[Enhancement](udf) support deterministic property for udf by linrrzqqq · Pull Request #62698 · apache/doris

linrrzqqq · 2026-04-22T06:53:29Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Previously, UDFs could be treated as deterministic in optimizer-related paths, which is unsafe for UDFs whose results are not stable across evaluations. That may cause invalid rewrite/planning decisions and lead to incorrect query semantics in some cases.

With this change, users can explicitly specify "deterministic"="true|false" when creating a UDF. If the property is not specified, it defaults to false.

CREATE TABLE cte_uuid_seed (id INT) ENGINE=OLAP DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ("replication_num" = "1");
INSERT INTO cte_uuid_seed VALUES (1),(2),(3);

DROP FUNCTION IF EXISTS py_uuid_token(INT);
CREATE FUNCTION py_uuid_token(INT)
RETURNS STRING
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "py_uuid_token_impl",
    "always_nullable" = "false",
    "runtime_version" = "3.12.11"
)
AS $$
import uuid
def py_uuid_token_impl(x):
    return f"{x}-{uuid.uuid4()}"
$$;

before:

SET enable_cte_materialize = true;
SET inline_cte_referenced_threshold = 10;

-- treated as deterministic func, which caused wrong planning
WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed)
SELECT id, COUNT(DISTINCT token) AS distinct_tokens
FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u
GROUP BY id ORDER BY id;
+------+-----------------+
| id   | distinct_tokens |
+------+-----------------+
|    1 |               2 |
|    2 |               2 |
|    3 |               2 |
+------+-----------------+

now

+------+-----------------+
| id   | distinct_tokens |
+------+-----------------+
|    1 |               1 |
|    2 |               1 |
|    3 |               1 |
+------+-----------------+

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes. [Enhancement](udf) support deterministic property for udf doris-website#3570

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Thearas · 2026-04-22T06:53:35Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

linrrzqqq · 2026-04-23T09:06:22Z

run buildall

linrrzqqq · 2026-04-23T09:07:56Z

/review

github-actions · 2026-04-23T09:12:44Z

OpenCode automated review failed and did not complete.

Error: Review step was skipped (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24826801894

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

hello-stephen · 2026-04-23T10:36:27Z

FE UT Coverage Report

Increment line coverage 0.00% (0/6) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-04-23T11:13:35Z

FE Regression Coverage Report

Increment line coverage 100.00% (6/6) 🎉
Increment coverage report
Complete coverage report

linrrzqqq · 2026-04-23T11:27:11Z

run buildall

linrrzqqq · 2026-04-23T11:27:19Z

/review

github-actions

I found blocking correctness issues.

Blanket isDeterministic() == false for all Java/Python UDF/UDAF/UDTF classes is too broad. The modified regression coverage uses deterministic helpers (IntTest, MySumInt, FloatTest, float_test.py), so this change now rejects pure external UDFs from MV/MTMV unless users opt into enable_nondeterministic_function=true and stops existing MV rewrite coverage from applying.
The optimizer fix is incomplete. Several rewrite paths that can duplicate or relocate expressions still key off containsUniqueFunction() rather than !isDeterministic(), so external UDFs can still be evaluated multiple times or moved incorrectly. A simple filter(project(udf(...) as a)) case is still vulnerable.
The bundled BE fix correctly handles unaligned Decimal256 Arrow reads, but the adjacent TYPE_DECIMALV2 Arrow branch still performs the same unsafe reinterpret_cast load and keeps the misalignment UB for DECIMALV2.

Critical checkpoints:

Goal / correctness: The PR fixes the specific CTE nondeterminism example, but not the end-to-end optimizer correctness problem and it broadens semantics for all external UDFs.
Scope / minimality: Not minimal; it changes default behavior for every Java/Python UDF/UDAF/UDTF and rewrites existing deterministic-UDF MV tests around that change.
Concurrency / lifecycle: No new concurrency or lifecycle issue found in the touched code.
Config / compatibility: No new config or compatibility issue found.
Parallel paths: Not all relevant rewrite paths were updated.
Tests: Added CTE/MTMV coverage is useful, but the modified float/java tests now encode the broader regression for deterministic UDFs and there is no regression case for the remaining filter/project duplication path.
Observability / transaction / FE-BE variable passing: Not applicable in this diff.
Performance: Blanket nondeterminism disables valid optimizations for pure UDFs.

No additional user-provided focus was supplied.

linrrzqqq · 2026-04-23T13:15:55Z

run buildall

linrrzqqq · 2026-04-23T13:16:01Z

/review

hello-stephen · 2026-04-23T13:27:05Z

FE Regression Coverage Report

Increment line coverage 100.00% (6/6) 🎉
Increment coverage report
Complete coverage report

github-actions

Additional blockers beyond the existing inline threads:

Pre-upgrade UDF metadata is not backward-compatible with the new persisted deterministic field. Old Java/Python UDFs/UDAFs/UDTFs replay as nondeterministic after upgrade, which silently changes SHOW CREATE FUNCTION output and can strand existing MTMVs in SCHEMA_CHANGE when ensureMTMVQueryUsable() re-analyzes the stored query.
The new per-UDF deterministic property is still not honored in cache paths. Deterministic external UDFs remain blanket-ineligible for SQL cache and fragment query cache because those paths still special-case Udf/AliasFunction instead of consulting isDeterministic().

Critical checkpoints:

Goal: Partially met. The PR plumbs a user-facing determinism flag through create/show/Nereids translation and adds regression coverage for CTE/MTMV scenarios, but it does not preserve compatibility for existing persisted UDFs and it does not propagate the flag to cache consumers.
Scope/focus: The code changes are localized, but the end-to-end contract of the new property is incomplete.
Concurrency: No new lock-order or thread-safety issue found in the touched code; the blocking problems are replay/analysis-path correctness issues.
Lifecycle/static init: No special lifecycle or static initialization issue found.
Config/properties: A new user-visible UDF property (deterministic) is added and parsed, but not all relevant consumers honor it.
Compatibility/persistence: Blocking issue. Function JSON persistence changed without a backward-compatible upgrade path for existing function metadata.
Parallel code paths: Blocking issue. Cache/query-normalization paths still bypass the new flag.
Special conditional checks: No additional issue beyond the above.
Test coverage: Added DDL/CTE/MV coverage is useful, but there is no coverage for upgrade/replay compatibility or SQL/query cache behavior.
Test result files: The new .out files look consistent with the added regression cases.
Observability: No new observability gap found for this change.
Transaction/persistence/data writes: Metadata persistence is affected; the replay compatibility issue above is the main blocker.
FE-BE variable passing: Not applicable here.
Performance: No material performance regression found in the touched code.
Other issues: No additional distinct blocker beyond the two comments here and the already-open review threads.

User focus: none provided.

hello-stephen · 2026-04-23T15:26:20Z

FE Regression Coverage Report

Increment line coverage 87.76% (43/49) 🎉
Increment coverage report
Complete coverage report

linrrzqqq · 2026-04-24T07:25:39Z

run buildall

hello-stephen · 2026-04-24T08:57:03Z

FE UT Coverage Report

Increment line coverage 20.41% (10/49) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-04-24T09:37:36Z

FE Regression Coverage Report

Increment line coverage 66.15% (43/65) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-04-24T11:37:38Z

FE Regression Coverage Report

Increment line coverage 42.57% (43/101) 🎉
Increment coverage report
Complete coverage report

HappenLee · 2026-05-07T13:08:04Z

/review

github-actions

I am requesting changes based on the already-open inline review threads; I did not add duplicate inline comments for the same issues.

Critical checkpoint conclusions:

Goal/test coverage: The PR aims to add an explicit deterministic property for Java/Python UDF, UDAF, and UDTF handling. The property is plumbed through catalog/Nereids classes and new tests exist, but existing review threads identify unresolved end-to-end gaps in cache/planner behavior and optimizer rewrites.
Scope/focus: The code changes are mostly localized, but the behavior change is broad because UDF determinism affects MV/MTMV eligibility, SQL cache, fragment query cache, CTE inlining, and optimizer rewrites.
Concurrency/lifecycle: I did not find new lock ordering, shared mutable state, or static initialization risks in this diff.
Configuration: No new config item is added; this is a persisted function property.
Compatibility/persistence: Existing review context already flags that old persisted functions lack the new field and may be replayed with a changed default, which is a blocking compatibility concern.
Parallel paths: Java/Python scalar, aggregate, and table-function wrappers are touched, but existing review context already flags remaining parallel paths that still do not respect the new determinism bit.
Conditional checks/error handling: Boolean property parsing follows existing property parsing style; I did not find an additional distinct error-handling issue.
Tests/results: Regression and FE unit coverage were added, with generated .out files present. Coverage still does not close the existing end-to-end gaps noted in the inline threads.
Observability/transaction/data-write/FE-BE protocol: No new observability need, data write path, or FE-BE protocol field was identified beyond persisted function metadata.
Performance: I did not find an additional distinct performance regression in the changed code.

User focus: No additional user-provided review focus was specified. I reviewed the full PR with extra attention to determinism propagation and did not find a new non-duplicate issue beyond the existing review threads.

linrrzqqq force-pushed the udf-nondeterministic branch from f2edf43 to 3678922 Compare April 23, 2026 11:25

github-actions Bot requested changes Apr 23, 2026

View reviewed changes

linrrzqqq force-pushed the udf-nondeterministic branch from 3678922 to f66187b Compare April 23, 2026 13:15

github-actions Bot requested changes Apr 23, 2026

View reviewed changes

Comment thread fe/fe-catalog/src/main/java/org/apache/doris/catalog/Function.java

Comment thread ...-core/src/main/java/org/apache/doris/nereids/trees/plans/commands/CreateFunctionCommand.java

[Enhancement](udf) support deterministic property for udf

9d73507

linrrzqqq force-pushed the udf-nondeterministic branch from f66187b to 9d73507 Compare April 24, 2026 02:22

linrrzqqq changed the title ~~[Fix](udf) mark udf nondeterministic~~ [Enhancement](udf) support deterministic property for udf Apr 24, 2026

github-actions Bot requested changes May 7, 2026

View reviewed changes

Conversation

linrrzqqq commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Apr 22, 2026

Uh oh!

linrrzqqq commented Apr 23, 2026

Uh oh!

linrrzqqq commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

hello-stephen commented Apr 23, 2026

FE UT Coverage Report

Uh oh!

hello-stephen commented Apr 23, 2026

FE Regression Coverage Report

Uh oh!

linrrzqqq commented Apr 23, 2026

Uh oh!

linrrzqqq commented Apr 23, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linrrzqqq commented Apr 23, 2026

Uh oh!

linrrzqqq commented Apr 23, 2026

Uh oh!

hello-stephen commented Apr 23, 2026

FE Regression Coverage Report

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hello-stephen commented Apr 23, 2026

FE Regression Coverage Report

Uh oh!

linrrzqqq commented Apr 24, 2026

Uh oh!

hello-stephen commented Apr 24, 2026

FE UT Coverage Report

Uh oh!

hello-stephen commented Apr 24, 2026

FE Regression Coverage Report

Uh oh!

hello-stephen commented Apr 24, 2026

FE Regression Coverage Report

Uh oh!

HappenLee commented May 7, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

linrrzqqq commented Apr 22, 2026 •

edited

Loading