Add ExpressionAnalyzer for pluggable expression-level statistics estimation by asolimando · Pull Request #21122 · apache/datafusion

asolimando · 2026-03-23T18:03:49Z

Which issue does this PR close?

Part of #21120 (framework + projection/filter integration)

Rationale for this change

DataFusion currently loses expression-level statistics when computing plan metadata. Projected expressions that aren't bare columns or literals get unknown statistics, and filter selectivity falls back to a hardcoded 20% when interval analysis cannot handle the predicate (e.g. OR predicates, which are not expressible as a single interval). There is also no extension point for users to provide statistics for their own UDFs.

This PR introduces ExpressionAnalyzer, a pluggable chain-of-responsibility framework that addresses these gaps. It follows the same extensibility pattern used elsewhere in DataFusion (ExprPlanner, OptimizerRule, StatisticsRegistry).

What changes are included in this PR?

ExpressionAnalyzer trait and ExpressionAnalyzerRegistry (chain-of-responsibility, first Computed wins)
DefaultExpressionAnalyzer with Selinger-style estimation: equality/inequality via NDV, AND/OR via inclusion-exclusion, injective arithmetic (+/-), literals, NOT
ProjectionExprs and FilterExec use the registry for expression-level statistics
Three new default trait methods on ExecutionPlan (uses_expression_level_statistics, with_expression_analyzer_registry, expression_analyzer_registry) for injection, overridden by FilterExec, ProjectionExec, AggregateExec, HashJoinExec, and SortMergeJoinExec
The physical planner injects the registry after plan creation and re-injects after each optimizer rule that modifies the plan, ensuring optimizer-created nodes always carry it
AggregateStatisticsProvider and JoinStatisticsProvider (feat: Add pluggable StatisticsRegistry for operator-level statistics propagation #21483) consume the registry via the trait getter
Config option optimizer.use_expression_analyzer (default false), zero overhead when disabled

Are these changes tested?

21 unit tests for ExpressionAnalyzer
4 integration tests verifying registry injection survives optimizer rules
5 end-to-end SLT tests (OR selectivity, TopK, UNION ALL, hash join filter pushdown)
Existing projection and filter test suites pass unchanged

Are there any user-facing changes?

New public API (purely additive, non-breaking):

ExpressionAnalyzer trait and ExpressionAnalyzerRegistry in datafusion-physical-expr
SessionState::expression_analyzer_registry() getter
SessionStateBuilder::with_expression_analyzer_registry() setter
Three default trait methods on ExecutionPlan: uses_expression_level_statistics(), with_expression_analyzer_registry(), expression_analyzer_registry()
Config option datafusion.optimizer.use_expression_analyzer

No breaking changes. Default behavior is unchanged (config defaults to false).

Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

asolimando · 2026-04-01T17:23:04Z

@2010YOUY01: FYI I took a final pass on the PR and marked it as "reviewable"

kosiew

@asolimando

Thanks for the solid work here. The new analyzer direction looks promising, but I ran into a couple of issues that could affect correctness and usefulness of the stats. Left a few detailed comments below.

asolimando · 2026-04-02T17:38:10Z

@asolimando

Thanks for the solid work here. The new analyzer direction looks promising, but I ran into a couple of issues that could affect correctness and usefulness of the stats. Left a few detailed comments below.

Thanks to you @kosiew for the spot-on review, and for sharing your feedback. I have addressed the requested changes, happy to iterate further if needed!

kosiew

@asolimando

Thanks for the follow-up here. The issues called out in the earlier review look addressed, including the projection ordering fix, the equality and inequality selectivity registry lookup, and narrowing the injective arithmetic NDV rule. I did notice one remaining gap around registry propagation through planner and optimizer-created projections.

kosiew · 2026-04-04T08:56:55Z

                    .map(|(expr, alias)| ProjectionExpr { expr, alias })
                    .collect();
-                Ok(Arc::new(ProjectionExec::try_new(proj_exprs, input_exec)?))
+                let mut proj_exec = ProjectionExec::try_new(proj_exprs, input_exec)?;


Thanks for wiring the analyzer registry into planner-built projections. I think there is still one hole here though.

Right now this gets attached when the planner initially creates a ProjectionExec, but several later paths still rebuild or insert projections with plain ProjectionExec::try_new(...), for example in datafusion/physical-plan/src/projection.rs, datafusion/physical-optimizer/src/projection_pushdown.rs, and datafusion/physical-optimizer/src/aggregate_statistics.rs.

After one of those rewrites, expression stats seem to fall back to unknown again even when datafusion.optimizer.enable_expression_analyzer = true.

Could we propagate the registry when cloning or rebuilding projections, or store it somewhere rebuilt plan nodes can recover it from? As written, this still looks plan-shape dependent after physical optimization.

@kosiew: this is a known limitation under the constraint of avoiding breaking changes entirely in this first PR: only the planner has access to SessionState, so operator nodes created later by optimizer rules don't receive the registry. The idea was to lift the limitation via an operator-level statistics registry (see #21443, which would be the second and final part of the pluggable design for statistics).

The limitation is documented in ProjectionExprs::with_expression_analyzer_registry and FilterExecBuilder::with_expression_analyzer_registry.

Apologies for having you figure it out independently, I should have called this out more prominently in the PR description rather than only in code comments.

If you see a better way to propagate the registry that I might have overlooked or if you think a breaking change is the right call here, I'd be happy to hear more about that.

@asolimando

Thanks for the detailed explanation. I understand the non-breaking-change constraint here, and the code comments on ProjectionExprs::with_expression_analyzer_registry / FilterExecBuilder::with_expression_analyzer_registry do make the current boundary clearer. I still think this remains a blocking limitation for the PR as it stands, though, because the feature is exposed behind datafusion.optimizer.enable_expression_analyzer = true yet its effect still depends on whether later physical rewrites rebuild the operator. Several common paths still create fresh ProjectionExec::try_new(...) nodes without a registry (datafusion/physical-plan/src/projection.rs, datafusion/physical-optimizer/src/projection_pushdown.rs, datafusion/physical-optimizer/src/aggregate_statistics.rs, etc.), so expression stats revert to unknown after optimization.

If carrying the registry through optimizer-created operators is not possible in this PR without the broader operator-level design from #21443, I think the safer options would be either:

narrow the current scope/docs so users do not read this as generally-enabled projection/filter expression analysis across the final physical plan, or

treat the operator-level propagation work as required before we rely on this config as a general optimizer feature.

I do not have a clean non-breaking propagation approach to suggest beyond the operator-level registry direction you mentioned, but I wanted to be explicit that I still see the current plan-shape dependency as an unresolved product/correctness concern rather than just a documentation gap.

Thanks for your guidance, @kosiew!

I have implemented option 1 optimistically in 54a0df7: tightened the config description and doc comments to make the scope explicit. But we can drop it in case option 2 is finally retained.

For option 2, I have opened #21483 (it was originally stacked on this PR, but I have reworked it to make it independent, the two can be composed later if #21483 gets in first) if you want to take a concrete look to evaluate whether you'd prefer to hold this PR until that one is ready.

hi @asolimando

I'll take a look again after you incorporate the merged #21483.

Thanks a lot @kosiew for your help, I got sidetracked today but I plan to finalize the rebasing in the next few days, I will re-request a review once ready!

@kosiew, thanks for your patience, here's what changed since your last review.

Addressing the plan-shape dependency:

To address this I finally included a re-injection loop in the physical planner that runs after each optimizer rule that modifies the plan (detected via Arc::ptr_eq). This ensures optimizer-created nodes (e.g. FilterExec from filter pushdown into unions, repositioned filters through joins, etc.) always receive the registry. The walk is entirely skipped when the ExpressionAnalyzer is disabled, no additional overhead for the common path.

Caveat: nodes created within a rule (e.g. a new FilterExec) receive the registry only after that rule completes, before the next rule runs. No such rule (EnforceDistribution, CombinePartialFinalAggregate, ProjectionPushdown, AggregateStatistics and JoinSelection) reads stats from the nodes it just created within the same rule execution, so there's no gap at the moment.

Since there is no gap and that this concern will go away if we integrate fully with partitions_statistics (see the "Future direction" section), which is the natural evolution of this proposal, I consider the integration satisfactory, what do you think?

Integration with StatisticsRegistry (#21483):

The AggregateStatisticsProvider and JoinStatisticsProvider providers consume ExpressionAnalyzer and this adds support for aggregations and joins. When no registry is injected, they delegate to the existing code path, purely additive as the rest.

Future direction: removing the injection overhead (just FYI, no action for this PR)

The re-injection loop for ExpressionAnalyzer and the separate StatisticsRegistry tree walk both exist because partition_statistics has no way to receive external context and we don't want breaking changes.
But we could piggyback on #20184 (child_stats in partition_statistics) as it plans to update partition_statistics's signature anyway.

By generalizing slightly with a StatisticsContext parameter that carries both the ExpressionAnalyzerRegistry and StatisticsRegistry providers, we would get headroom for future changes while limiting breaking changes.

This would collapse both mechanisms into the natural partition_statistics recursion, eliminating the injection walks and the separate bottom-up traversal entirely while keeping the config flags to turn either framework on or off.

The separate walks are acceptable now as a trade-off for a purely additive and non-breaking change, but this would be the natural evolution of the proposal.

New commits in detail:

I had to rebase on #21483, this will make incremental reviewing harder for you, so let me walk you through the changes since your last review (commits 9-12 of 12):

Inject ExpressionAnalyzerRegistry into exec nodes and re-inject via optimizer loop - the core change addressing your feedback, includes integration tests

test(filter): verify ExpressionAnalyzer inclusion-exclusion selectivity for OR predicates - unit test for OR selectivity in FilterExec

refactor(expression_analyzer): delegate when no NDV statistics are available - I realized we were trying to do too much when some columns statistics are unset, now we delegate to built-in

test(expression_analyzer): add StatisticsTable and end-to-end SLT for OR selectivity - end-to-end SLT tests verifying the registry survives TopK sort, UNION ALL filter pushdown, and hash join filter pushdown (queries chosen as they generate nodes that would need the ExpressionAnalyzer, exactly the gap we had before)

Looking forward to your feedback!

kosiew

@asolimando, I spotted an issue with NDV handling that can affect selectivity estimates depending on operand order.

…propagation (#21483) ## Which issue does this PR close? - Part of #21443 (Pluggable operator-level statistics propagation) - Part of #8227 (statistics improvements epic) ## Rationale for this change DataFusion's built-in statistics propagation has no extension point: downstream projects cannot inject external catalog stats, override built-in estimation, or plug in custom strategies without forking. This PR introduces `StatisticsRegistry`, a pluggable chain-of-responsibility for operator-level statistics following the same pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer` (#21120) for expression-level stats. See #21443 for full motivation and design context. ## What changes are included in this PR? 1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait, `StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics` (Statistics + type-erased extension map), `DefaultStatisticsProvider`. `PhysicalOptimizerContext` trait with `optimize_with_context` dispatch. `SessionState` integration. 2. Built-in providers for Filter, Projection, Passthrough (sort/repartition/etc), Aggregate, Join (hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities: `num_distinct_vals`, `ndv_after_selectivity`. 3. `ClosureStatisticsProvider`: closure-based provider for test injection and cardinality feedback. 4. JoinSelection integration: `use_statistics_registry` config flag (default false), registry-aware `optimize_with_context`, SLT test demonstrating plan difference on skewed data. ## Are these changes tested? - 39 unit tests covering all providers, NDV utilities, chain priority, and edge cases (Inexact precision, Absent propagation, Partial aggregate delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV, exact Cartesian product, CrossJoin, GlobalLimit skip+fetch) - 1 SLT test (`statistics_registry.slt`): three-table join on skewed data (8:1:1 customer_id distribution) where the built-in NDV formula estimates 33 rows (wrong; actual=66) and the registry conservatively estimates 100, producing the correct build-side swap ## Are there any user-facing changes? New public API (purely additive, non-breaking): - `StatisticsProvider` trait and `StatisticsRegistry` in `datafusion-physical-plan` - `ExtendedStatistics`, `StatisticsResult` types; built-in provider structs; `num_distinct_vals`, `ndv_after_selectivity` utilities - `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in `datafusion-physical-optimizer` - `SessionState::statistics_registry()`, `SessionStateBuilder::with_statistics_registry()` - Config: `datafusion.optimizer.use_statistics_registry` (default false) Default behavior is unchanged. The registry is only consulted when the flag is explicitly enabled. Known limitations: - Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit boundaries are not improved: these operators call `partition_statistics(None)` internally, re-fetching raw child stats and discarding registry enrichment. 4 TODO comments mark the affected call sites; #20184 would close this gap. - No `ExpressionAnalyzer` integration yet (#21122). --- Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

…propagation (apache#21483) ## Which issue does this PR close? - Part of apache#21443 (Pluggable operator-level statistics propagation) - Part of apache#8227 (statistics improvements epic) ## Rationale for this change DataFusion's built-in statistics propagation has no extension point: downstream projects cannot inject external catalog stats, override built-in estimation, or plug in custom strategies without forking. This PR introduces `StatisticsRegistry`, a pluggable chain-of-responsibility for operator-level statistics following the same pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer` (apache#21120) for expression-level stats. See apache#21443 for full motivation and design context. ## What changes are included in this PR? 1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait, `StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics` (Statistics + type-erased extension map), `DefaultStatisticsProvider`. `PhysicalOptimizerContext` trait with `optimize_with_context` dispatch. `SessionState` integration. 2. Built-in providers for Filter, Projection, Passthrough (sort/repartition/etc), Aggregate, Join (hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities: `num_distinct_vals`, `ndv_after_selectivity`. 3. `ClosureStatisticsProvider`: closure-based provider for test injection and cardinality feedback. 4. JoinSelection integration: `use_statistics_registry` config flag (default false), registry-aware `optimize_with_context`, SLT test demonstrating plan difference on skewed data. ## Are these changes tested? - 39 unit tests covering all providers, NDV utilities, chain priority, and edge cases (Inexact precision, Absent propagation, Partial aggregate delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV, exact Cartesian product, CrossJoin, GlobalLimit skip+fetch) - 1 SLT test (`statistics_registry.slt`): three-table join on skewed data (8:1:1 customer_id distribution) where the built-in NDV formula estimates 33 rows (wrong; actual=66) and the registry conservatively estimates 100, producing the correct build-side swap ## Are there any user-facing changes? New public API (purely additive, non-breaking): - `StatisticsProvider` trait and `StatisticsRegistry` in `datafusion-physical-plan` - `ExtendedStatistics`, `StatisticsResult` types; built-in provider structs; `num_distinct_vals`, `ndv_after_selectivity` utilities - `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in `datafusion-physical-optimizer` - `SessionState::statistics_registry()`, `SessionStateBuilder::with_statistics_registry()` - Config: `datafusion.optimizer.use_statistics_registry` (default false) Default behavior is unchanged. The registry is only consulted when the flag is explicitly enabled. Known limitations: - Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit boundaries are not improved: these operators call `partition_statistics(None)` internally, re-fetching raw child stats and discarding registry enrichment. 4 TODO comments mark the affected call sites; apache#20184 would close this gap. - No `ExpressionAnalyzer` integration yet (apache#21122). --- Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

Introduce ExpressionAnalyzer, a chain-of-responsibility framework for expression-level statistics estimation (NDV, selectivity, min/max). Framework: - ExpressionAnalyzer trait with registry parameter for chain delegation - ExpressionAnalyzerRegistry to chain analyzers (first Computed wins) - DefaultExpressionAnalyzer: Selinger-style estimation for columns, literals, binary expressions, NOT, boolean predicates Integration: - ExpressionAnalyzerRegistry stored in SessionState, initialized once - ProjectionExprs stores optional registry (non-breaking, no signature changes to project_statistics) - ProjectionExec sets registry via Projector, injected by planner - FilterExec uses registry for selectivity when interval analysis cannot handle the predicate - Custom nodes get builtin analyzer as fallback when registry is absent

- Regenerate configs.md for new enable_expression_analyzer option - Add enable_expression_analyzer to information_schema.slt expected output - Fix unresolved doc links to SessionState and DefaultExpressionAnalyzer (cross-crate references use backticks instead of doc links) - Simplify config description

…putation - Fix expression_analyzer_registry doc comment misplaced between function_factory's doc comment and field declaration - Fix module doc example import path (physical_plan -> physical_expr) - Extract expression_analyzer_registry() helper in planner to avoid repeating the config check 4 times - Defer left_sel/right_sel computation to AND/OR arms only, avoiding unnecessary sub-expression selectivity estimation for comparison operators

…ptimizer loop Add trait methods on ExecutionPlan for expression-level statistics injection (uses_expression_level_statistics, with_expression_analyzer_registry, expression_analyzer_registry). The physical planner injects the registry after plan creation and re-injects after each optimizer rule that modifies the plan, gated by the use_expression_analyzer config flag.

…ty for OR predicates OR predicates are inherently outside interval arithmetic (a union of two disjoint intervals cannot be represented as a single interval). This test confirms that ExpressionAnalyzerRegistry computes the correct inclusion-exclusion selectivity (0.28 = 0.1 + 0.2 - 0.02) on a 1000-row input, versus the default 20% (200 rows) without a registry.

…ailable Return Delegate for all leaf predicates when NDV is unavailable, and propagate Delegate upward through AND/OR/NOT when any child has no estimate. DefaultExpressionAnalyzer now only produces a result when it has a genuine information advantage (NDV from column statistics).

… OR selectivity Add a reusable StatisticsTable (TableProvider + ExecutionPlan with user-supplied statistics) to the sqllogictest harness, and use it in expression_analyzer.slt

…hortened use_expression_analyzer doc

asolimando · 2026-04-15T20:43:10Z

@kosiew the CI error seems unrelated, the same test suite passed for macos but not amd64, I can't reproduce and it doesn't seem it can be caused by changes in this PR, it feels like a flaky test but I'd like to get your opinion.

github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate common Related to common crate physical-plan Changes to the physical-plan crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Mar 23, 2026

asolimando force-pushed the asolimando/ndv-expression-analyzer branch from 322b97f to f101c51 Compare March 25, 2026 12:05

2010YOUY01 mentioned this pull request Mar 26, 2026

Pluggable expression-level statistics estimation (ExpressionAnalyzer) #21120

Open

9 tasks

asolimando force-pushed the asolimando/ndv-expression-analyzer branch 2 times, most recently from dfa1324 to f6f27ac Compare April 1, 2026 17:18

asolimando marked this pull request as ready for review April 1, 2026 17:18

kosiew requested changes Apr 2, 2026

View reviewed changes

Comment thread datafusion/physical-expr/src/projection.rs

Comment thread datafusion/physical-expr/src/expression_analyzer/default.rs Outdated

Comment thread datafusion/physical-expr/src/expression_analyzer/default.rs Outdated

asolimando requested a review from kosiew April 2, 2026 17:38

kosiew requested changes Apr 4, 2026

View reviewed changes

2010YOUY01 mentioned this pull request Apr 5, 2026

Estimate aggregate output rows using existing NDV statistics #20926

Merged

kosiew requested changes Apr 8, 2026

View reviewed changes

Comment thread datafusion/physical-expr/src/expression_analyzer/default.rs Outdated

asolimando added 7 commits April 15, 2026 18:52

Snapshot column stats for analyzer to avoid ordering dependency

2c9ee37

Use registry NDV for equality/inequality selectivity on expressions

698caf2

Narrow injective arithmetic to addition and subtraction only

5c85fe8

Make resolve_ndv symmetric: use max NDV across both operands

3f10a1d

Clarify enable_expression_analyzer scope in config and doc comments

9b91988

asolimando force-pushed the asolimando/ndv-expression-analyzer branch from 54a0df7 to ae2a0b8 Compare April 15, 2026 17:43

github-actions bot added the optimizer Optimizer rules label Apr 15, 2026

asolimando force-pushed the asolimando/ndv-expression-analyzer branch from eb44380 to 1068f39 Compare April 15, 2026 18:30

asolimando added 6 commits April 15, 2026 21:41

test(expression_analyzer): add StatisticsTable and end-to-end SLT for…

4be9245

… OR selectivity Add a reusable StatisticsTable (TableProvider + ExecutionPlan with user-supplied statistics) to the sqllogictest harness, and use it in expression_analyzer.slt

fix: cargo fmt, clippy clone_on_ref_ptr, update configs.md

04450ee

fix(slt): update information_schema.slt config description to match s…

42d0f8e

…hortened use_expression_analyzer doc

asolimando force-pushed the asolimando/ndv-expression-analyzer branch from c37838b to 42d0f8e Compare April 15, 2026 19:41

fix(doc): remove broken intra-doc link for Statistics in StatisticsTable

b6d2621

asolimando requested a review from kosiew April 15, 2026 20:15

Conversation

asolimando commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

asolimando commented Apr 1, 2026

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asolimando commented Apr 2, 2026

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Apr 15, 2026

Choose a reason for hiding this comment

Addressing the plan-shape dependency:

Integration with StatisticsRegistry (#21483):

Future direction: removing the injection overhead (just FYI, no action for this PR)

New commits in detail:

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asolimando commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asolimando commented Mar 23, 2026 •

edited

Loading

kosiew left a comment •

edited

Loading

kosiew left a comment •

edited

Loading

asolimando Apr 7, 2026 •

edited

Loading