[Enhancement](udf) support deterministic property for udf#62698
[Enhancement](udf) support deterministic property for udf#62698linrrzqqq wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was skipped (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
f2edf43 to
3678922
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found blocking correctness issues.
-
Blanket
isDeterministic() == falsefor all Java/Python UDF/UDAF/UDTF classes is too broad. The modified regression coverage uses deterministic helpers (IntTest,MySumInt,FloatTest,float_test.py), so this change now rejects pure external UDFs from MV/MTMV unless users opt intoenable_nondeterministic_function=trueand stops existing MV rewrite coverage from applying. -
The optimizer fix is incomplete. Several rewrite paths that can duplicate or relocate expressions still key off
containsUniqueFunction()rather than!isDeterministic(), so external UDFs can still be evaluated multiple times or moved incorrectly. A simplefilter(project(udf(...) as a))case is still vulnerable. -
The bundled BE fix correctly handles unaligned
Decimal256Arrow reads, but the adjacentTYPE_DECIMALV2Arrow branch still performs the same unsafereinterpret_castload and keeps the misalignment UB for DECIMALV2.
Critical checkpoints:
- Goal / correctness: The PR fixes the specific CTE nondeterminism example, but not the end-to-end optimizer correctness problem and it broadens semantics for all external UDFs.
- Scope / minimality: Not minimal; it changes default behavior for every Java/Python UDF/UDAF/UDTF and rewrites existing deterministic-UDF MV tests around that change.
- Concurrency / lifecycle: No new concurrency or lifecycle issue found in the touched code.
- Config / compatibility: No new config or compatibility issue found.
- Parallel paths: Not all relevant rewrite paths were updated.
- Tests: Added CTE/MTMV coverage is useful, but the modified float/java tests now encode the broader regression for deterministic UDFs and there is no regression case for the remaining filter/project duplication path.
- Observability / transaction / FE-BE variable passing: Not applicable in this diff.
- Performance: Blanket nondeterminism disables valid optimizations for pure UDFs.
No additional user-provided focus was supplied.
3678922 to
f66187b
Compare
|
run buildall |
|
/review |
FE Regression Coverage ReportIncrement line coverage |
There was a problem hiding this comment.
Additional blockers beyond the existing inline threads:
- Pre-upgrade UDF metadata is not backward-compatible with the new persisted
deterministicfield. Old Java/Python UDFs/UDAFs/UDTFs replay as nondeterministic after upgrade, which silently changesSHOW CREATE FUNCTIONoutput and can strand existing MTMVs inSCHEMA_CHANGEwhenensureMTMVQueryUsable()re-analyzes the stored query. - The new per-UDF
deterministicproperty is still not honored in cache paths. Deterministic external UDFs remain blanket-ineligible for SQL cache and fragment query cache because those paths still special-caseUdf/AliasFunctioninstead of consultingisDeterministic().
Critical checkpoints:
- Goal: Partially met. The PR plumbs a user-facing determinism flag through create/show/Nereids translation and adds regression coverage for CTE/MTMV scenarios, but it does not preserve compatibility for existing persisted UDFs and it does not propagate the flag to cache consumers.
- Scope/focus: The code changes are localized, but the end-to-end contract of the new property is incomplete.
- Concurrency: No new lock-order or thread-safety issue found in the touched code; the blocking problems are replay/analysis-path correctness issues.
- Lifecycle/static init: No special lifecycle or static initialization issue found.
- Config/properties: A new user-visible UDF property (
deterministic) is added and parsed, but not all relevant consumers honor it. - Compatibility/persistence: Blocking issue.
FunctionJSON persistence changed without a backward-compatible upgrade path for existing function metadata. - Parallel code paths: Blocking issue. Cache/query-normalization paths still bypass the new flag.
- Special conditional checks: No additional issue beyond the above.
- Test coverage: Added DDL/CTE/MV coverage is useful, but there is no coverage for upgrade/replay compatibility or SQL/query cache behavior.
- Test result files: The new
.outfiles look consistent with the added regression cases. - Observability: No new observability gap found for this change.
- Transaction/persistence/data writes: Metadata persistence is affected; the replay compatibility issue above is the main blocker.
- FE-BE variable passing: Not applicable here.
- Performance: No material performance regression found in the touched code.
- Other issues: No additional distinct blocker beyond the two comments here and the already-open review threads.
User focus: none provided.
FE Regression Coverage ReportIncrement line coverage |
f66187b to
9d73507
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
I am requesting changes based on the already-open inline review threads; I did not add duplicate inline comments for the same issues.
Critical checkpoint conclusions:
- Goal/test coverage: The PR aims to add an explicit deterministic property for Java/Python UDF, UDAF, and UDTF handling. The property is plumbed through catalog/Nereids classes and new tests exist, but existing review threads identify unresolved end-to-end gaps in cache/planner behavior and optimizer rewrites.
- Scope/focus: The code changes are mostly localized, but the behavior change is broad because UDF determinism affects MV/MTMV eligibility, SQL cache, fragment query cache, CTE inlining, and optimizer rewrites.
- Concurrency/lifecycle: I did not find new lock ordering, shared mutable state, or static initialization risks in this diff.
- Configuration: No new config item is added; this is a persisted function property.
- Compatibility/persistence: Existing review context already flags that old persisted functions lack the new field and may be replayed with a changed default, which is a blocking compatibility concern.
- Parallel paths: Java/Python scalar, aggregate, and table-function wrappers are touched, but existing review context already flags remaining parallel paths that still do not respect the new determinism bit.
- Conditional checks/error handling: Boolean property parsing follows existing property parsing style; I did not find an additional distinct error-handling issue.
- Tests/results: Regression and FE unit coverage were added, with generated
.outfiles present. Coverage still does not close the existing end-to-end gaps noted in the inline threads. - Observability/transaction/data-write/FE-BE protocol: No new observability need, data write path, or FE-BE protocol field was identified beyond persisted function metadata.
- Performance: I did not find an additional distinct performance regression in the changed code.
User focus: No additional user-provided review focus was specified. I reviewed the full PR with extra attention to determinism propagation and did not find a new non-duplicate issue beyond the existing review threads.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Previously, UDFs could be treated as deterministic in optimizer-related paths, which is unsafe for UDFs whose results are not stable across evaluations. That may cause invalid rewrite/planning decisions and lead to incorrect query semantics in some cases.
With this change, users can explicitly specify
"deterministic"="true|false"when creating a UDF. If the property is not specified, it defaults to false.before:
now
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)