Skip to content

[Enhancement](udf) support deterministic property for udf#62698

Open
linrrzqqq wants to merge 1 commit intoapache:masterfrom
linrrzqqq:udf-nondeterministic
Open

[Enhancement](udf) support deterministic property for udf#62698
linrrzqqq wants to merge 1 commit intoapache:masterfrom
linrrzqqq:udf-nondeterministic

Conversation

@linrrzqqq
Copy link
Copy Markdown
Collaborator

@linrrzqqq linrrzqqq commented Apr 22, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Previously, UDFs could be treated as deterministic in optimizer-related paths, which is unsafe for UDFs whose results are not stable across evaluations. That may cause invalid rewrite/planning decisions and lead to incorrect query semantics in some cases.

With this change, users can explicitly specify "deterministic"="true|false" when creating a UDF. If the property is not specified, it defaults to false.

CREATE TABLE cte_uuid_seed (id INT) ENGINE=OLAP DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ("replication_num" = "1");
INSERT INTO cte_uuid_seed VALUES (1),(2),(3);

DROP FUNCTION IF EXISTS py_uuid_token(INT);
CREATE FUNCTION py_uuid_token(INT)
RETURNS STRING
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "py_uuid_token_impl",
    "always_nullable" = "false",
    "runtime_version" = "3.12.11"
)
AS $$
import uuid
def py_uuid_token_impl(x):
    return f"{x}-{uuid.uuid4()}"
$$;

before:

SET enable_cte_materialize = true;
SET inline_cte_referenced_threshold = 10;

-- treated as deterministic func, which caused wrong planning
WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed)
SELECT id, COUNT(DISTINCT token) AS distinct_tokens
FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u
GROUP BY id ORDER BY id;
+------+-----------------+
| id   | distinct_tokens |
+------+-----------------+
|    1 |               2 |
|    2 |               2 |
|    3 |               2 |
+------+-----------------+

now

+------+-----------------+
| id   | distinct_tokens |
+------+-----------------+
|    1 |               1 |
|    2 |               1 |
|    3 |               1 |
+------+-----------------+

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 22, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

@github-actions
Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Review step was skipped (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24826801894

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/6) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 100.00% (6/6) 🎉
Increment coverage report
Complete coverage report

@linrrzqqq linrrzqqq force-pushed the udf-nondeterministic branch from f2edf43 to 3678922 Compare April 23, 2026 11:25
@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found blocking correctness issues.

  1. Blanket isDeterministic() == false for all Java/Python UDF/UDAF/UDTF classes is too broad. The modified regression coverage uses deterministic helpers (IntTest, MySumInt, FloatTest, float_test.py), so this change now rejects pure external UDFs from MV/MTMV unless users opt into enable_nondeterministic_function=true and stops existing MV rewrite coverage from applying.

  2. The optimizer fix is incomplete. Several rewrite paths that can duplicate or relocate expressions still key off containsUniqueFunction() rather than !isDeterministic(), so external UDFs can still be evaluated multiple times or moved incorrectly. A simple filter(project(udf(...) as a)) case is still vulnerable.

  3. The bundled BE fix correctly handles unaligned Decimal256 Arrow reads, but the adjacent TYPE_DECIMALV2 Arrow branch still performs the same unsafe reinterpret_cast load and keeps the misalignment UB for DECIMALV2.

Critical checkpoints:

  • Goal / correctness: The PR fixes the specific CTE nondeterminism example, but not the end-to-end optimizer correctness problem and it broadens semantics for all external UDFs.
  • Scope / minimality: Not minimal; it changes default behavior for every Java/Python UDF/UDAF/UDTF and rewrites existing deterministic-UDF MV tests around that change.
  • Concurrency / lifecycle: No new concurrency or lifecycle issue found in the touched code.
  • Config / compatibility: No new config or compatibility issue found.
  • Parallel paths: Not all relevant rewrite paths were updated.
  • Tests: Added CTE/MTMV coverage is useful, but the modified float/java tests now encode the broader regression for deterministic UDFs and there is no regression case for the remaining filter/project duplication path.
  • Observability / transaction / FE-BE variable passing: Not applicable in this diff.
  • Performance: Blanket nondeterminism disables valid optimizations for pure UDFs.

No additional user-provided focus was supplied.

Comment thread be/src/core/data_type_serde/data_type_decimal_serde.cpp Outdated
@linrrzqqq linrrzqqq force-pushed the udf-nondeterministic branch from 3678922 to f66187b Compare April 23, 2026 13:15
@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 100.00% (6/6) 🎉
Increment coverage report
Complete coverage report

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional blockers beyond the existing inline threads:

  1. Pre-upgrade UDF metadata is not backward-compatible with the new persisted deterministic field. Old Java/Python UDFs/UDAFs/UDTFs replay as nondeterministic after upgrade, which silently changes SHOW CREATE FUNCTION output and can strand existing MTMVs in SCHEMA_CHANGE when ensureMTMVQueryUsable() re-analyzes the stored query.
  2. The new per-UDF deterministic property is still not honored in cache paths. Deterministic external UDFs remain blanket-ineligible for SQL cache and fragment query cache because those paths still special-case Udf/AliasFunction instead of consulting isDeterministic().

Critical checkpoints:

  • Goal: Partially met. The PR plumbs a user-facing determinism flag through create/show/Nereids translation and adds regression coverage for CTE/MTMV scenarios, but it does not preserve compatibility for existing persisted UDFs and it does not propagate the flag to cache consumers.
  • Scope/focus: The code changes are localized, but the end-to-end contract of the new property is incomplete.
  • Concurrency: No new lock-order or thread-safety issue found in the touched code; the blocking problems are replay/analysis-path correctness issues.
  • Lifecycle/static init: No special lifecycle or static initialization issue found.
  • Config/properties: A new user-visible UDF property (deterministic) is added and parsed, but not all relevant consumers honor it.
  • Compatibility/persistence: Blocking issue. Function JSON persistence changed without a backward-compatible upgrade path for existing function metadata.
  • Parallel code paths: Blocking issue. Cache/query-normalization paths still bypass the new flag.
  • Special conditional checks: No additional issue beyond the above.
  • Test coverage: Added DDL/CTE/MV coverage is useful, but there is no coverage for upgrade/replay compatibility or SQL/query cache behavior.
  • Test result files: The new .out files look consistent with the added regression cases.
  • Observability: No new observability gap found for this change.
  • Transaction/persistence/data writes: Metadata persistence is affected; the replay compatibility issue above is the main blocker.
  • FE-BE variable passing: Not applicable here.
  • Performance: No material performance regression found in the touched code.
  • Other issues: No additional distinct blocker beyond the two comments here and the already-open review threads.

User focus: none provided.

Comment thread fe/fe-catalog/src/main/java/org/apache/doris/catalog/Function.java
@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 87.76% (43/49) 🎉
Increment coverage report
Complete coverage report

@linrrzqqq linrrzqqq force-pushed the udf-nondeterministic branch from f66187b to 9d73507 Compare April 24, 2026 02:22
@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@linrrzqqq linrrzqqq changed the title [Fix](udf) mark udf nondeterministic [Enhancement](udf) support deterministic property for udf Apr 24, 2026
@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 20.41% (10/49) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 66.15% (43/65) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 42.57% (43/101) 🎉
Increment coverage report
Complete coverage report

@HappenLee
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am requesting changes based on the already-open inline review threads; I did not add duplicate inline comments for the same issues.

Critical checkpoint conclusions:

  • Goal/test coverage: The PR aims to add an explicit deterministic property for Java/Python UDF, UDAF, and UDTF handling. The property is plumbed through catalog/Nereids classes and new tests exist, but existing review threads identify unresolved end-to-end gaps in cache/planner behavior and optimizer rewrites.
  • Scope/focus: The code changes are mostly localized, but the behavior change is broad because UDF determinism affects MV/MTMV eligibility, SQL cache, fragment query cache, CTE inlining, and optimizer rewrites.
  • Concurrency/lifecycle: I did not find new lock ordering, shared mutable state, or static initialization risks in this diff.
  • Configuration: No new config item is added; this is a persisted function property.
  • Compatibility/persistence: Existing review context already flags that old persisted functions lack the new field and may be replayed with a changed default, which is a blocking compatibility concern.
  • Parallel paths: Java/Python scalar, aggregate, and table-function wrappers are touched, but existing review context already flags remaining parallel paths that still do not respect the new determinism bit.
  • Conditional checks/error handling: Boolean property parsing follows existing property parsing style; I did not find an additional distinct error-handling issue.
  • Tests/results: Regression and FE unit coverage were added, with generated .out files present. Coverage still does not close the existing end-to-end gaps noted in the inline threads.
  • Observability/transaction/data-write/FE-BE protocol: No new observability need, data write path, or FE-BE protocol field was identified beyond persisted function metadata.
  • Performance: I did not find an additional distinct performance regression in the changed code.

User focus: No additional user-provided review focus was specified. I reviewed the full PR with extra attention to determinism propagation and did not find a new non-duplicate issue beyond the existing review threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants