Add ExpressionAnalyzer for pluggable expression-level statistics estimation#21122
Add ExpressionAnalyzer for pluggable expression-level statistics estimation#21122asolimando wants to merge 15 commits intoapache:mainfrom
Conversation
322b97f to
f101c51
Compare
dfa1324 to
f6f27ac
Compare
|
@2010YOUY01: FYI I took a final pass on the PR and marked it as "reviewable" |
Thanks to you @kosiew for the spot-on review, and for sharing your feedback. I have addressed the requested changes, happy to iterate further if needed! |
There was a problem hiding this comment.
Thanks for the follow-up here. The issues called out in the earlier review look addressed, including the projection ordering fix, the equality and inequality selectivity registry lookup, and narrowing the injective arithmetic NDV rule. I did notice one remaining gap around registry propagation through planner and optimizer-created projections.
| .map(|(expr, alias)| ProjectionExpr { expr, alias }) | ||
| .collect(); | ||
| Ok(Arc::new(ProjectionExec::try_new(proj_exprs, input_exec)?)) | ||
| let mut proj_exec = ProjectionExec::try_new(proj_exprs, input_exec)?; |
There was a problem hiding this comment.
Thanks for wiring the analyzer registry into planner-built projections. I think there is still one hole here though.
Right now this gets attached when the planner initially creates a ProjectionExec, but several later paths still rebuild or insert projections with plain ProjectionExec::try_new(...), for example in datafusion/physical-plan/src/projection.rs, datafusion/physical-optimizer/src/projection_pushdown.rs, and datafusion/physical-optimizer/src/aggregate_statistics.rs.
After one of those rewrites, expression stats seem to fall back to unknown again even when datafusion.optimizer.enable_expression_analyzer = true.
Could we propagate the registry when cloning or rebuilding projections, or store it somewhere rebuilt plan nodes can recover it from? As written, this still looks plan-shape dependent after physical optimization.
There was a problem hiding this comment.
@kosiew: this is a known limitation under the constraint of avoiding breaking changes entirely in this first PR: only the planner has access to SessionState, so operator nodes created later by optimizer rules don't receive the registry. The idea was to lift the limitation via an operator-level statistics registry (see #21443, which would be the second and final part of the pluggable design for statistics).
The limitation is documented in ProjectionExprs::with_expression_analyzer_registry and FilterExecBuilder::with_expression_analyzer_registry.
Apologies for having you figure it out independently, I should have called this out more prominently in the PR description rather than only in code comments.
If you see a better way to propagate the registry that I might have overlooked or if you think a breaking change is the right call here, I'd be happy to hear more about that.
There was a problem hiding this comment.
Thanks for the detailed explanation. I understand the non-breaking-change constraint here, and the code comments on ProjectionExprs::with_expression_analyzer_registry / FilterExecBuilder::with_expression_analyzer_registry do make the current boundary clearer. I still think this remains a blocking limitation for the PR as it stands, though, because the feature is exposed behind datafusion.optimizer.enable_expression_analyzer = true yet its effect still depends on whether later physical rewrites rebuild the operator. Several common paths still create fresh ProjectionExec::try_new(...) nodes without a registry (datafusion/physical-plan/src/projection.rs, datafusion/physical-optimizer/src/projection_pushdown.rs, datafusion/physical-optimizer/src/aggregate_statistics.rs, etc.), so expression stats revert to unknown after optimization.
If carrying the registry through optimizer-created operators is not possible in this PR without the broader operator-level design from #21443, I think the safer options would be either:
- narrow the current scope/docs so users do not read this as generally-enabled projection/filter expression analysis across the final physical plan, or
- treat the operator-level propagation work as required before we rely on this config as a general optimizer feature.
I do not have a clean non-breaking propagation approach to suggest beyond the operator-level registry direction you mentioned, but I wanted to be explicit that I still see the current plan-shape dependency as an unresolved product/correctness concern rather than just a documentation gap.
There was a problem hiding this comment.
Thanks for your guidance, @kosiew!
I have implemented option 1 optimistically in 54a0df7: tightened the config description and doc comments to make the scope explicit. But we can drop it in case option 2 is finally retained.
For option 2, I have opened #21483 (it was originally stacked on this PR, but I have reworked it to make it independent, the two can be composed later if #21483 gets in first) if you want to take a concrete look to evaluate whether you'd prefer to hold this PR until that one is ready.
There was a problem hiding this comment.
hi @asolimando
I'll take a look again after you incorporate the merged #21483.
There was a problem hiding this comment.
Thanks a lot @kosiew for your help, I got sidetracked today but I plan to finalize the rebasing in the next few days, I will re-request a review once ready!
There was a problem hiding this comment.
@kosiew, thanks for your patience, here's what changed since your last review.
Addressing the plan-shape dependency:
To address this I finally included a re-injection loop in the physical planner that runs after each optimizer rule that modifies the plan (detected via Arc::ptr_eq). This ensures optimizer-created nodes (e.g. FilterExec from filter pushdown into unions, repositioned filters through joins, etc.) always receive the registry. The walk is entirely skipped when the ExpressionAnalyzer is disabled, no additional overhead for the common path.
Caveat: nodes created within a rule (e.g. a new FilterExec) receive the registry only after that rule completes, before the next rule runs. No such rule (EnforceDistribution, CombinePartialFinalAggregate, ProjectionPushdown, AggregateStatistics and JoinSelection) reads stats from the nodes it just created within the same rule execution, so there's no gap at the moment.
Since there is no gap and that this concern will go away if we integrate fully with partitions_statistics (see the "Future direction" section), which is the natural evolution of this proposal, I consider the integration satisfactory, what do you think?
Integration with StatisticsRegistry (#21483):
The AggregateStatisticsProvider and JoinStatisticsProvider providers consume ExpressionAnalyzer and this adds support for aggregations and joins. When no registry is injected, they delegate to the existing code path, purely additive as the rest.
Future direction: removing the injection overhead (just FYI, no action for this PR)
The re-injection loop for ExpressionAnalyzer and the separate StatisticsRegistry tree walk both exist because partition_statistics has no way to receive external context and we don't want breaking changes.
But we could piggyback on #20184 (child_stats in partition_statistics) as it plans to update partition_statistics's signature anyway.
By generalizing slightly with a StatisticsContext parameter that carries both the ExpressionAnalyzerRegistry and StatisticsRegistry providers, we would get headroom for future changes while limiting breaking changes.
This would collapse both mechanisms into the natural partition_statistics recursion, eliminating the injection walks and the separate bottom-up traversal entirely while keeping the config flags to turn either framework on or off.
The separate walks are acceptable now as a trade-off for a purely additive and non-breaking change, but this would be the natural evolution of the proposal.
New commits in detail:
I had to rebase on #21483, this will make incremental reviewing harder for you, so let me walk you through the changes since your last review (commits 9-12 of 12):
- Inject ExpressionAnalyzerRegistry into exec nodes and re-inject via optimizer loop - the core change addressing your feedback, includes integration tests
- test(filter): verify ExpressionAnalyzer inclusion-exclusion selectivity for OR predicates - unit test for OR selectivity in FilterExec
- refactor(expression_analyzer): delegate when no NDV statistics are available - I realized we were trying to do too much when some columns statistics are unset, now we delegate to built-in
- test(expression_analyzer): add StatisticsTable and end-to-end SLT for OR selectivity - end-to-end SLT tests verifying the registry survives TopK sort, UNION ALL filter pushdown, and hash join filter pushdown (queries chosen as they generate nodes that would need the
ExpressionAnalyzer, exactly the gap we had before)
Looking forward to your feedback!
kosiew
left a comment
There was a problem hiding this comment.
@asolimando, I spotted an issue with NDV handling that can affect selectivity estimates depending on operand order.
…propagation (#21483) ## Which issue does this PR close? - Part of #21443 (Pluggable operator-level statistics propagation) - Part of #8227 (statistics improvements epic) ## Rationale for this change DataFusion's built-in statistics propagation has no extension point: downstream projects cannot inject external catalog stats, override built-in estimation, or plug in custom strategies without forking. This PR introduces `StatisticsRegistry`, a pluggable chain-of-responsibility for operator-level statistics following the same pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer` (#21120) for expression-level stats. See #21443 for full motivation and design context. ## What changes are included in this PR? 1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait, `StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics` (Statistics + type-erased extension map), `DefaultStatisticsProvider`. `PhysicalOptimizerContext` trait with `optimize_with_context` dispatch. `SessionState` integration. 2. Built-in providers for Filter, Projection, Passthrough (sort/repartition/etc), Aggregate, Join (hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities: `num_distinct_vals`, `ndv_after_selectivity`. 3. `ClosureStatisticsProvider`: closure-based provider for test injection and cardinality feedback. 4. JoinSelection integration: `use_statistics_registry` config flag (default false), registry-aware `optimize_with_context`, SLT test demonstrating plan difference on skewed data. ## Are these changes tested? - 39 unit tests covering all providers, NDV utilities, chain priority, and edge cases (Inexact precision, Absent propagation, Partial aggregate delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV, exact Cartesian product, CrossJoin, GlobalLimit skip+fetch) - 1 SLT test (`statistics_registry.slt`): three-table join on skewed data (8:1:1 customer_id distribution) where the built-in NDV formula estimates 33 rows (wrong; actual=66) and the registry conservatively estimates 100, producing the correct build-side swap ## Are there any user-facing changes? New public API (purely additive, non-breaking): - `StatisticsProvider` trait and `StatisticsRegistry` in `datafusion-physical-plan` - `ExtendedStatistics`, `StatisticsResult` types; built-in provider structs; `num_distinct_vals`, `ndv_after_selectivity` utilities - `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in `datafusion-physical-optimizer` - `SessionState::statistics_registry()`, `SessionStateBuilder::with_statistics_registry()` - Config: `datafusion.optimizer.use_statistics_registry` (default false) Default behavior is unchanged. The registry is only consulted when the flag is explicitly enabled. Known limitations: - Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit boundaries are not improved: these operators call `partition_statistics(None)` internally, re-fetching raw child stats and discarding registry enrichment. 4 TODO comments mark the affected call sites; #20184 would close this gap. - No `ExpressionAnalyzer` integration yet (#21122). --- Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.
…propagation (apache#21483) ## Which issue does this PR close? - Part of apache#21443 (Pluggable operator-level statistics propagation) - Part of apache#8227 (statistics improvements epic) ## Rationale for this change DataFusion's built-in statistics propagation has no extension point: downstream projects cannot inject external catalog stats, override built-in estimation, or plug in custom strategies without forking. This PR introduces `StatisticsRegistry`, a pluggable chain-of-responsibility for operator-level statistics following the same pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer` (apache#21120) for expression-level stats. See apache#21443 for full motivation and design context. ## What changes are included in this PR? 1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait, `StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics` (Statistics + type-erased extension map), `DefaultStatisticsProvider`. `PhysicalOptimizerContext` trait with `optimize_with_context` dispatch. `SessionState` integration. 2. Built-in providers for Filter, Projection, Passthrough (sort/repartition/etc), Aggregate, Join (hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities: `num_distinct_vals`, `ndv_after_selectivity`. 3. `ClosureStatisticsProvider`: closure-based provider for test injection and cardinality feedback. 4. JoinSelection integration: `use_statistics_registry` config flag (default false), registry-aware `optimize_with_context`, SLT test demonstrating plan difference on skewed data. ## Are these changes tested? - 39 unit tests covering all providers, NDV utilities, chain priority, and edge cases (Inexact precision, Absent propagation, Partial aggregate delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV, exact Cartesian product, CrossJoin, GlobalLimit skip+fetch) - 1 SLT test (`statistics_registry.slt`): three-table join on skewed data (8:1:1 customer_id distribution) where the built-in NDV formula estimates 33 rows (wrong; actual=66) and the registry conservatively estimates 100, producing the correct build-side swap ## Are there any user-facing changes? New public API (purely additive, non-breaking): - `StatisticsProvider` trait and `StatisticsRegistry` in `datafusion-physical-plan` - `ExtendedStatistics`, `StatisticsResult` types; built-in provider structs; `num_distinct_vals`, `ndv_after_selectivity` utilities - `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in `datafusion-physical-optimizer` - `SessionState::statistics_registry()`, `SessionStateBuilder::with_statistics_registry()` - Config: `datafusion.optimizer.use_statistics_registry` (default false) Default behavior is unchanged. The registry is only consulted when the flag is explicitly enabled. Known limitations: - Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit boundaries are not improved: these operators call `partition_statistics(None)` internally, re-fetching raw child stats and discarding registry enrichment. 4 TODO comments mark the affected call sites; apache#20184 would close this gap. - No `ExpressionAnalyzer` integration yet (apache#21122). --- Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.
Introduce ExpressionAnalyzer, a chain-of-responsibility framework for expression-level statistics estimation (NDV, selectivity, min/max). Framework: - ExpressionAnalyzer trait with registry parameter for chain delegation - ExpressionAnalyzerRegistry to chain analyzers (first Computed wins) - DefaultExpressionAnalyzer: Selinger-style estimation for columns, literals, binary expressions, NOT, boolean predicates Integration: - ExpressionAnalyzerRegistry stored in SessionState, initialized once - ProjectionExprs stores optional registry (non-breaking, no signature changes to project_statistics) - ProjectionExec sets registry via Projector, injected by planner - FilterExec uses registry for selectivity when interval analysis cannot handle the predicate - Custom nodes get builtin analyzer as fallback when registry is absent
- Regenerate configs.md for new enable_expression_analyzer option - Add enable_expression_analyzer to information_schema.slt expected output - Fix unresolved doc links to SessionState and DefaultExpressionAnalyzer (cross-crate references use backticks instead of doc links) - Simplify config description
…putation - Fix expression_analyzer_registry doc comment misplaced between function_factory's doc comment and field declaration - Fix module doc example import path (physical_plan -> physical_expr) - Extract expression_analyzer_registry() helper in planner to avoid repeating the config check 4 times - Defer left_sel/right_sel computation to AND/OR arms only, avoiding unnecessary sub-expression selectivity estimation for comparison operators
54a0df7 to
ae2a0b8
Compare
eb44380 to
1068f39
Compare
…ptimizer loop Add trait methods on ExecutionPlan for expression-level statistics injection (uses_expression_level_statistics, with_expression_analyzer_registry, expression_analyzer_registry). The physical planner injects the registry after plan creation and re-injects after each optimizer rule that modifies the plan, gated by the use_expression_analyzer config flag.
…ty for OR predicates OR predicates are inherently outside interval arithmetic (a union of two disjoint intervals cannot be represented as a single interval). This test confirms that ExpressionAnalyzerRegistry computes the correct inclusion-exclusion selectivity (0.28 = 0.1 + 0.2 - 0.02) on a 1000-row input, versus the default 20% (200 rows) without a registry.
…ailable Return Delegate for all leaf predicates when NDV is unavailable, and propagate Delegate upward through AND/OR/NOT when any child has no estimate. DefaultExpressionAnalyzer now only produces a result when it has a genuine information advantage (NDV from column statistics).
… OR selectivity Add a reusable StatisticsTable (TableProvider + ExecutionPlan with user-supplied statistics) to the sqllogictest harness, and use it in expression_analyzer.slt
…hortened use_expression_analyzer doc
c37838b to
42d0f8e
Compare
|
@kosiew the CI error seems unrelated, the same test suite passed for |
Which issue does this PR close?
Part of #21120 (framework + projection/filter integration)
Rationale for this change
DataFusion currently loses expression-level statistics when computing plan metadata. Projected expressions that aren't bare columns or literals get unknown statistics, and filter selectivity falls back to a hardcoded 20% when interval analysis cannot handle the predicate (e.g. OR predicates, which are not expressible as a single interval). There is also no extension point for users to provide statistics for their own UDFs.
This PR introduces
ExpressionAnalyzer, a pluggable chain-of-responsibility framework that addresses these gaps. It follows the same extensibility pattern used elsewhere in DataFusion (ExprPlanner,OptimizerRule,StatisticsRegistry).What changes are included in this PR?
ExpressionAnalyzertrait andExpressionAnalyzerRegistry(chain-of-responsibility, firstComputedwins)DefaultExpressionAnalyzerwith Selinger-style estimation: equality/inequality via NDV, AND/OR via inclusion-exclusion, injective arithmetic (+/-), literals, NOTProjectionExprsandFilterExecuse the registry for expression-level statisticsExecutionPlan(uses_expression_level_statistics,with_expression_analyzer_registry,expression_analyzer_registry) for injection, overridden byFilterExec,ProjectionExec,AggregateExec,HashJoinExec, andSortMergeJoinExecAggregateStatisticsProviderandJoinStatisticsProvider(feat: Add pluggable StatisticsRegistry for operator-level statistics propagation #21483) consume the registry via the trait getteroptimizer.use_expression_analyzer(default false), zero overhead when disabledAre these changes tested?
Are there any user-facing changes?
New public API (purely additive, non-breaking):
ExpressionAnalyzertrait andExpressionAnalyzerRegistryindatafusion-physical-exprSessionState::expression_analyzer_registry()getterSessionStateBuilder::with_expression_analyzer_registry()setterExecutionPlan:uses_expression_level_statistics(),with_expression_analyzer_registry(),expression_analyzer_registry()datafusion.optimizer.use_expression_analyzerNo breaking changes. Default behavior is unchanged (config defaults to false).
Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.