Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions docs/query-data/udf/python-user-defined-function.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "entry_function_name",
"runtime_version" = "python_version",
"deterministic" = "true|false",
"always_nullable" = "true|false"
)
AS $$
Expand All @@ -58,7 +59,8 @@ RETURNS INT
PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "evaluate",
"runtime_version" = "3.10.12"
"runtime_version" = "3.10.12",
"deterministic" = "true"
)
AS $$
def evaluate(a, b):
Expand All @@ -77,7 +79,8 @@ RETURNS STRING
PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "evaluate",
"runtime_version" = "3.10.12"
"runtime_version" = "3.10.12",
"deterministic" = "true"
)
AS $$
def evaluate(s1, s2):
Expand Down Expand Up @@ -362,6 +365,7 @@ DROP FUNCTION IF EXISTS py_is_prime(INT);
| `symbol` | Yes | - | Python function entry name.<br>• **Inline Mode**: Write function name directly, such as `"evaluate"`<br>• **Module Mode**: Format is `[package_name.]module_name.func_name`, see module mode description |
| `file` | No | - | Python `.zip` package path, only required for module mode. Supports three protocols:<br>• `file://` - Local filesystem path<br>• `http://` - HTTP remote download<br>• `https://` - HTTPS remote download |
| `runtime_version` | Yes | - | Python runtime version, such as `"3.10.12"`, requires complete version number |
| `deterministic` | No | `false` | Whether the Python UDF is deterministic.<br>Set it to `true` only when the same inputs always produce the same outputs, and the implementation does not depend on current time, random numbers, or external mutable state.<br>Correctly marking this property allows the optimizer to handle rewrite and other optimization scenarios more safely; incorrect marking may cause wrong query rewrite or pushdown behavior. |
| `always_nullable` | No | `true` | Whether to always return nullable results |

#### Runtime Version Description
Expand Down Expand Up @@ -979,6 +983,7 @@ PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "ClassName",
"runtime_version" = "python_version",
"deterministic" = "true|false",
"always_nullable" = "true|false"
)
AS $$
Expand Down Expand Up @@ -1409,6 +1414,7 @@ DROP FUNCTION IF EXISTS py_variance(DOUBLE);
| `symbol` | Yes | - | Python class name.<br>• **Inline Mode**: Write class name directly, such as `"SumUDAF"`<br>• **Module Mode**: Format is `[package_name.]module_name.ClassName` |
| `file` | No | - | Python `.zip` package path, only required for module mode. Supports three protocols:<br>• `file://` - Local filesystem path<br>• `http://` - HTTP remote download<br>• `https://` - HTTPS remote download |
| `runtime_version` | Yes | - | Python runtime version, such as `"3.10.12"` |
| `deterministic` | No | `false` | Whether the Python UDAF is deterministic.<br>Set it to `true` only when the same inputs always produce the same outputs, and the implementation does not depend on current time, random numbers, or external mutable state.<br>Correctly marking this property allows the optimizer to handle rewrite and other optimization scenarios more safely; incorrect marking may cause wrong query rewrite or pushdown behavior. |
| `always_nullable` | No | `true` | Whether to always return nullable results |

#### runtime_version Description
Expand Down Expand Up @@ -1907,6 +1913,7 @@ PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "function_name",
"runtime_version" = "python_version",
"deterministic" = "true|false",
"always_nullable" = "true|false"
)
AS $$
Expand Down Expand Up @@ -2405,6 +2412,7 @@ CREATE TABLES FUNCTION py_split(STRING, STRING) ...;
| `symbol` | Yes | - | Python function name.<br>• **Inline Mode**: Write function name directly, such as `"split_string_udtf"`<br>• **Module Mode**: Format is `[package_name.]module_name.function_name` |
| `file` | No | - | Python `.zip` package path, only required for module mode. Supports three protocols:<br>• `file://` - Local filesystem path<br>• `http://` - HTTP remote download<br>• `https://` - HTTPS remote download |
| `runtime_version` | Yes | - | Python runtime version, such as `"3.10.12"` |
| `deterministic` | No | `false` | Whether the Python UDTF is deterministic.<br>Set it to `true` only when the same inputs always produce the same outputs, and the implementation does not depend on current time, random numbers, or external mutable state.<br>Correctly marking this property allows the optimizer to handle rewrite and other optimization scenarios more safely; incorrect marking may cause wrong query rewrite or pushdown behavior. |
| `always_nullable` | No | `true` | Whether to always return nullable results |

#### runtime_version Description
Expand Down
90 changes: 89 additions & 1 deletion docs/sql-manual/sql-statements/function/CREATE-FUNCTION.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ CREATE [ GLOBAL ]
> - `symbol`: Indicates the class name containing the UDF class. This parameter is mandatory.
> - `type`: Indicates the UDF call type. The default is Native. Use JAVA_UDF when using a Java UDF.
> - `always_nullable`: Indicates whether the UDF result may contain NULL values. This is an optional parameter with a default value of true.
> - `deterministic`: Indicates whether a Java UDF or Python UDF is deterministic. This is an optional parameter with a default value of false. Set it to true only when identical inputs always produce identical outputs, and the implementation does not depend on current time, random numbers, or external mutable state Correct marking allows the optimizer to handle query rewrites more safely; incorrect marking may lead to wrong query results.

## Access Control Requirements

Expand Down Expand Up @@ -135,4 +136,91 @@ To execute this command, the user must have `ADMIN_PRIV` privileges.

```sql
CREATE GLOBAL ALIAS FUNCTION id_masking(INT) WITH PARAMETER(id) AS CONCAT(LEFT(id, 3), '****', RIGHT(id, 4));
```
```

6. Create a non-deterministic Python UDF. Functions such as `uuid.uuid4()` that depend on randomness should keep the default `deterministic = false` and must not be incorrectly marked as `true`.

```sql
CREATE TABLE cte_uuid_seed (id INT) ENGINE=OLAP DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ("replication_num" = "1");
INSERT INTO cte_uuid_seed VALUES (1),(2),(3);

DROP FUNCTION IF EXISTS py_uuid_token(INT);
CREATE FUNCTION py_uuid_token(INT)
RETURNS STRING
PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "py_uuid_token_impl",
"always_nullable" = "false",
"runtime_version" = "3.12.11"
)
AS $$
import uuid
def py_uuid_token_impl(x):
return f"{x}-{uuid.uuid4()}"
$$;

SET enable_cte_materialize = true;
SET inline_cte_referenced_threshold = 10;

WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed)
SELECT id, COUNT(DISTINCT token) AS distinct_tokens
FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u
GROUP BY id ORDER BY id;
```

Correct result:

```text
+------+-----------------+
| id | distinct_tokens |
+------+-----------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
+------+-----------------+
```

For this function, the following definition is incorrect:

```sql
DROP FUNCTION IF EXISTS py_uuid_token(INT);
CREATE FUNCTION py_uuid_token(INT)
RETURNS STRING
PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "py_uuid_token_impl",
"always_nullable" = "false",
"runtime_version" = "3.12.11",
"deterministic" = "true"
)
AS $$
import uuid
def py_uuid_token_impl(x):
return f"{x}-{uuid.uuid4()}"
$$;
```

Run the same query again:

```sql
WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed)
SELECT id, COUNT(DISTINCT token) AS distinct_tokens
FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u
GROUP BY id ORDER BY id;
```

Incorrect result:

```text
+------+-----------------+
| id | distinct_tokens |
+------+-----------------+
| 1 | 2 |
| 2 | 2 |
| 3 | 2 |
+------+-----------------+
```

Why this is wrong:
Because `py_uuid_token` is non-deterministic, each call to `uuid.uuid4()` generates a new value. If the function is incorrectly marked as `deterministic = true`, the optimizer may treat repeated references as safe to rewrite and may choose a plan that evaluates the UDF separately on both sides of `UNION ALL`. As a result, the same `id` can produce two different `token` values, and `COUNT(DISTINCT token)` changes from `1` to `2`.
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "entry_function_name",
"runtime_version" = "python_version",
"deterministic" = "true|false",
"always_nullable" = "true|false"
)
AS $$
Expand All @@ -58,7 +59,8 @@ RETURNS INT
PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "evaluate",
"runtime_version" = "3.10.12"
"runtime_version" = "3.10.12",
"deterministic" = "true"
)
AS $$
def evaluate(a, b):
Expand All @@ -77,7 +79,8 @@ RETURNS STRING
PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "evaluate",
"runtime_version" = "3.10.12"
"runtime_version" = "3.10.12",
"deterministic" = "true"
)
AS $$
def evaluate(s1, s2):
Expand Down Expand Up @@ -362,6 +365,7 @@ DROP FUNCTION IF EXISTS py_is_prime(INT);
| `symbol` | 是 | - | Python 函数入口名称。<br>• **内联模式**: 直接写函数名,如 `"evaluate"`<br>• **模块模式**: 格式为 `[package_name.]module_name.func_name`,详见模块模式说明 |
| `file` | 否 | - | Python `.zip` 包路径,仅模块模式需要。支持三种协议:<br>• `file://` - 本地文件系统路径<br>• `http://` - HTTP 远程下载<br>• `https://` - HTTPS 远程下载 |
| `runtime_version` | 是 | - | Python 运行时版本,如 `"3.10.12"`,需填写完整的版本号 |
| `deterministic` | 否 | `false` | Python UDF 是否为确定性函数。<br>只有在相同输入始终产生相同输出,且实现不依赖当前时间、随机数、外部可变状态时,才应设置为 `true`。<br>正确标记后,优化器在查询改写和其他优化场景中可以基于稳定语义做更安全的处理;错误标记可能导致错误的查询改写或下推行为。 |
| `always_nullable` | 否 | `true` | 是否总是返回可空结果 |

#### 运行时版本说明
Expand Down Expand Up @@ -979,6 +983,7 @@ PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "ClassName",
"runtime_version" = "python_version",
"deterministic" = "true|false",
"always_nullable" = "true|false"
)
AS $$
Expand Down Expand Up @@ -1409,6 +1414,7 @@ DROP FUNCTION IF EXISTS py_variance(DOUBLE);
| `symbol` | 是 | - | Python 类名。<br>• **内联模式**: 直接写类名,如 `"SumUDAF"`<br>• **模块模式**: 格式为 `[package_name.]module_name.ClassName` |
| `file` | 否 | - | Python `.zip` 包路径,仅模块模式需要。支持三种协议:<br>• `file://` - 本地文件系统路径<br>• `http://` - HTTP 远程下载<br>• `https://` - HTTPS 远程下载 |
| `runtime_version` | 是 | - | Python 运行时版本,如 `"3.10.12"` |
| `deterministic` | 否 | `false` | Python UDAF 是否为确定性函数。<br>只有在相同输入始终产生相同输出,且实现不依赖当前时间、随机数、外部可变状态时,才应设置为 `true`。<br>正确标记后,优化器在查询改写和其他优化场景中可以基于稳定语义做更安全的处理;错误标记可能导致错误的查询改写或下推行为。 |
| `always_nullable` | 否 | `true` | 是否总是返回可空结果 |

#### runtime_version 说明
Expand Down Expand Up @@ -1907,6 +1913,7 @@ PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "function_name",
"runtime_version" = "python_version",
"deterministic" = "true|false",
"always_nullable" = "true|false"
)
AS $$
Expand Down Expand Up @@ -2405,6 +2412,7 @@ CREATE TABLES FUNCTION py_split(STRING, STRING) ...;
| `symbol` | 是 | - | Python 函数名。<br>• **内联模式**: 直接写函数名,如 `"split_string_udtf"`<br>• **模块模式**: 格式为 `[package_name.]module_name.function_name` |
| `file` | 否 | - | Python `.zip` 包路径,仅模块模式需要。支持三种协议:<br>• `file://` - 本地文件系统路径<br>• `http://` - HTTP 远程下载<br>• `https://` - HTTPS 远程下载 |
| `runtime_version` | 是 | - | Python 运行时版本,如 `"3.10.12"` |
| `deterministic` | 否 | `false` | Python UDTF 是否为确定性函数。<br>只有在相同输入始终产生相同输出,且实现不依赖当前时间、随机数、外部可变状态时,才应设置为 `true`。<br>正确标记后,优化器在查询改写和其他优化场景中可以基于稳定语义做更安全的处理;错误标记可能导致错误的查询改写或下推行为。 |
| `always_nullable` | 否 | `true` | 是否总是返回可空结果 |

#### runtime_version 说明
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ CREATE [ GLOBAL ]
> - `symbol`: 表示的是包含 UDF 类的类名。这个参数是必须设定的
> - `type`: 表示的 UDF 调用类型,默认为 Native,使用 Java UDF 时传 JAVA_UDF。
> - `always_nullable`:表示的 UDF 返回结果中是否有可能出现 NULL 值,是可选参数,默认值为 true。
> - `deterministic`:表示 Java UDF 或 Python UDF 是否为确定性函数,可选参数,默认值为 false。只有在相同输入始终产生相同输出,且实现不依赖当前时间、随机数、外部可变状态时,才应设置为 true。正确标记后,优化器在查询改写等场景中可以基于稳定语义做更安全的处理;错误标记可能导致错误的查询结果。

## 权限控制

Expand Down Expand Up @@ -124,4 +125,91 @@ CREATE [ GLOBAL ]

```sql
CREATE GLOBAL ALIAS FUNCTION id_masking(INT) WITH PARAMETER(id) AS CONCAT(LEFT(id, 3), '****', RIGHT(id, 4));
```
```

6. 创建一个非确定性的 Python UDF。像 `uuid.uuid4()` 这类依赖随机数的函数,应保持 `deterministic` 的默认值 `false`,不要错误标记为 `true`。

```sql
CREATE TABLE cte_uuid_seed (id INT) ENGINE=OLAP DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ("replication_num" = "1");
INSERT INTO cte_uuid_seed VALUES (1),(2),(3);

DROP FUNCTION IF EXISTS py_uuid_token(INT);
CREATE FUNCTION py_uuid_token(INT)
RETURNS STRING
PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "py_uuid_token_impl",
"always_nullable" = "false",
"runtime_version" = "3.12.11"
)
AS $$
import uuid
def py_uuid_token_impl(x):
return f"{x}-{uuid.uuid4()}"
$$;

SET enable_cte_materialize = true;
SET inline_cte_referenced_threshold = 10;

WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed)
SELECT id, COUNT(DISTINCT token) AS distinct_tokens
FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u
GROUP BY id ORDER BY id;
```

正确结果:

```text
+------+-----------------+
| id | distinct_tokens |
+------+-----------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
+------+-----------------+
```

对于上述函数,不应写成下面这样:

```sql
DROP FUNCTION IF EXISTS py_uuid_token(INT);
CREATE FUNCTION py_uuid_token(INT)
RETURNS STRING
PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "py_uuid_token_impl",
"always_nullable" = "false",
"runtime_version" = "3.12.11",
"deterministic" = "true"
)
AS $$
import uuid
def py_uuid_token_impl(x):
return f"{x}-{uuid.uuid4()}"
$$;
```

重新执行同一条查询:

```sql
WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed)
SELECT id, COUNT(DISTINCT token) AS distinct_tokens
FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u
GROUP BY id ORDER BY id;
```

错误结果:

```text
+------+-----------------+
| id | distinct_tokens |
+------+-----------------+
| 1 | 2 |
| 2 | 2 |
| 3 | 2 |
+------+-----------------+
```

错误原因:
`py_uuid_token` 是非确定性函数,`uuid.uuid4()` 每次调用都会生成新值。如果错误地将它标记为 `deterministic = true`,优化器可能会把重复引用视为可安全改写,并选择在 `UNION ALL` 两侧分别执行 UDF 的计划。这样同一个 `id` 会生成两个不同的 `token`,`COUNT(DISTINCT token)` 就会从 `1` 变成 `2`。
Loading