Skip to content

Add ExpressionAnalyzer for pluggable expression-level statistics estimation#21122

Open
asolimando wants to merge 15 commits intoapache:mainfrom
asolimando:asolimando/ndv-expression-analyzer
Open

Add ExpressionAnalyzer for pluggable expression-level statistics estimation#21122
asolimando wants to merge 15 commits intoapache:mainfrom
asolimando:asolimando/ndv-expression-analyzer

Conversation

@asolimando
Copy link
Copy Markdown
Member

@asolimando asolimando commented Mar 23, 2026

Which issue does this PR close?

Part of #21120 (framework + projection/filter integration)

Rationale for this change

DataFusion currently loses expression-level statistics when computing plan metadata. Projected expressions that aren't bare columns or literals get unknown statistics, and filter selectivity falls back to a hardcoded 20% when interval analysis cannot handle the predicate (e.g. OR predicates, which are not expressible as a single interval). There is also no extension point for users to provide statistics for their own UDFs.

This PR introduces ExpressionAnalyzer, a pluggable chain-of-responsibility framework that addresses these gaps. It follows the same extensibility pattern used elsewhere in DataFusion (ExprPlanner, OptimizerRule, StatisticsRegistry).

What changes are included in this PR?

  • ExpressionAnalyzer trait and ExpressionAnalyzerRegistry (chain-of-responsibility, first Computed wins)
  • DefaultExpressionAnalyzer with Selinger-style estimation: equality/inequality via NDV, AND/OR via inclusion-exclusion, injective arithmetic (+/-), literals, NOT
  • ProjectionExprs and FilterExec use the registry for expression-level statistics
  • Three new default trait methods on ExecutionPlan (uses_expression_level_statistics, with_expression_analyzer_registry, expression_analyzer_registry) for injection, overridden by FilterExec, ProjectionExec, AggregateExec, HashJoinExec, and SortMergeJoinExec
  • The physical planner injects the registry after plan creation and re-injects after each optimizer rule that modifies the plan, ensuring optimizer-created nodes always carry it
  • AggregateStatisticsProvider and JoinStatisticsProvider (feat: Add pluggable StatisticsRegistry for operator-level statistics propagation #21483) consume the registry via the trait getter
  • Config option optimizer.use_expression_analyzer (default false), zero overhead when disabled

Are these changes tested?

  • 21 unit tests for ExpressionAnalyzer
  • 4 integration tests verifying registry injection survives optimizer rules
  • 5 end-to-end SLT tests (OR selectivity, TopK, UNION ALL, hash join filter pushdown)
  • Existing projection and filter test suites pass unchanged

Are there any user-facing changes?

New public API (purely additive, non-breaking):

  • ExpressionAnalyzer trait and ExpressionAnalyzerRegistry in datafusion-physical-expr
  • SessionState::expression_analyzer_registry() getter
  • SessionStateBuilder::with_expression_analyzer_registry() setter
  • Three default trait methods on ExecutionPlan: uses_expression_level_statistics(), with_expression_analyzer_registry(), expression_analyzer_registry()
  • Config option datafusion.optimizer.use_expression_analyzer

No breaking changes. Default behavior is unchanged (config defaults to false).


Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate common Related to common crate physical-plan Changes to the physical-plan crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Mar 23, 2026
@asolimando asolimando force-pushed the asolimando/ndv-expression-analyzer branch from 322b97f to f101c51 Compare March 25, 2026 12:05
@asolimando asolimando force-pushed the asolimando/ndv-expression-analyzer branch 2 times, most recently from dfa1324 to f6f27ac Compare April 1, 2026 17:18
@asolimando asolimando marked this pull request as ready for review April 1, 2026 17:18
@asolimando
Copy link
Copy Markdown
Member Author

@2010YOUY01: FYI I took a final pass on the PR and marked it as "reviewable"

Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asolimando

Thanks for the solid work here. The new analyzer direction looks promising, but I ran into a couple of issues that could affect correctness and usefulness of the stats. Left a few detailed comments below.

Comment thread datafusion/physical-expr/src/projection.rs
Comment thread datafusion/physical-expr/src/expression_analyzer/default.rs Outdated
Comment thread datafusion/physical-expr/src/expression_analyzer/default.rs Outdated
@asolimando
Copy link
Copy Markdown
Member Author

@asolimando

Thanks for the solid work here. The new analyzer direction looks promising, but I ran into a couple of issues that could affect correctness and usefulness of the stats. Left a few detailed comments below.

Thanks to you @kosiew for the spot-on review, and for sharing your feedback. I have addressed the requested changes, happy to iterate further if needed!

@asolimando asolimando requested a review from kosiew April 2, 2026 17:38
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asolimando

Thanks for the follow-up here. The issues called out in the earlier review look addressed, including the projection ordering fix, the equality and inequality selectivity registry lookup, and narrowing the injective arithmetic NDV rule. I did notice one remaining gap around registry propagation through planner and optimizer-created projections.

Comment thread datafusion/core/src/physical_planner.rs Outdated
.map(|(expr, alias)| ProjectionExpr { expr, alias })
.collect();
Ok(Arc::new(ProjectionExec::try_new(proj_exprs, input_exec)?))
let mut proj_exec = ProjectionExec::try_new(proj_exprs, input_exec)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for wiring the analyzer registry into planner-built projections. I think there is still one hole here though.

Right now this gets attached when the planner initially creates a ProjectionExec, but several later paths still rebuild or insert projections with plain ProjectionExec::try_new(...), for example in datafusion/physical-plan/src/projection.rs, datafusion/physical-optimizer/src/projection_pushdown.rs, and datafusion/physical-optimizer/src/aggregate_statistics.rs.

After one of those rewrites, expression stats seem to fall back to unknown again even when datafusion.optimizer.enable_expression_analyzer = true.

Could we propagate the registry when cloning or rebuilding projections, or store it somewhere rebuilt plan nodes can recover it from? As written, this still looks plan-shape dependent after physical optimization.

Copy link
Copy Markdown
Member Author

@asolimando asolimando Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kosiew: this is a known limitation under the constraint of avoiding breaking changes entirely in this first PR: only the planner has access to SessionState, so operator nodes created later by optimizer rules don't receive the registry. The idea was to lift the limitation via an operator-level statistics registry (see #21443, which would be the second and final part of the pluggable design for statistics).

The limitation is documented in ProjectionExprs::with_expression_analyzer_registry and FilterExecBuilder::with_expression_analyzer_registry.

Apologies for having you figure it out independently, I should have called this out more prominently in the PR description rather than only in code comments.

If you see a better way to propagate the registry that I might have overlooked or if you think a breaking change is the right call here, I'd be happy to hear more about that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asolimando

Thanks for the detailed explanation. I understand the non-breaking-change constraint here, and the code comments on ProjectionExprs::with_expression_analyzer_registry / FilterExecBuilder::with_expression_analyzer_registry do make the current boundary clearer. I still think this remains a blocking limitation for the PR as it stands, though, because the feature is exposed behind datafusion.optimizer.enable_expression_analyzer = true yet its effect still depends on whether later physical rewrites rebuild the operator. Several common paths still create fresh ProjectionExec::try_new(...) nodes without a registry (datafusion/physical-plan/src/projection.rs, datafusion/physical-optimizer/src/projection_pushdown.rs, datafusion/physical-optimizer/src/aggregate_statistics.rs, etc.), so expression stats revert to unknown after optimization.

If carrying the registry through optimizer-created operators is not possible in this PR without the broader operator-level design from #21443, I think the safer options would be either:

  • narrow the current scope/docs so users do not read this as generally-enabled projection/filter expression analysis across the final physical plan, or
  • treat the operator-level propagation work as required before we rely on this config as a general optimizer feature.

I do not have a clean non-breaking propagation approach to suggest beyond the operator-level registry direction you mentioned, but I wanted to be explicit that I still see the current plan-shape dependency as an unresolved product/correctness concern rather than just a documentation gap.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your guidance, @kosiew!

I have implemented option 1 optimistically in 54a0df7: tightened the config description and doc comments to make the scope explicit. But we can drop it in case option 2 is finally retained.

For option 2, I have opened #21483 (it was originally stacked on this PR, but I have reworked it to make it independent, the two can be composed later if #21483 gets in first) if you want to take a concrete look to evaluate whether you'd prefer to hold this PR until that one is ready.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @asolimando

I'll take a look again after you incorporate the merged #21483.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @kosiew for your help, I got sidetracked today but I plan to finalize the rebasing in the next few days, I will re-request a review once ready!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kosiew, thanks for your patience, here's what changed since your last review.

Addressing the plan-shape dependency:

To address this I finally included a re-injection loop in the physical planner that runs after each optimizer rule that modifies the plan (detected via Arc::ptr_eq). This ensures optimizer-created nodes (e.g. FilterExec from filter pushdown into unions, repositioned filters through joins, etc.) always receive the registry. The walk is entirely skipped when the ExpressionAnalyzer is disabled, no additional overhead for the common path.

Caveat: nodes created within a rule (e.g. a new FilterExec) receive the registry only after that rule completes, before the next rule runs. No such rule (EnforceDistribution, CombinePartialFinalAggregate, ProjectionPushdown, AggregateStatistics and JoinSelection) reads stats from the nodes it just created within the same rule execution, so there's no gap at the moment.

Since there is no gap and that this concern will go away if we integrate fully with partitions_statistics (see the "Future direction" section), which is the natural evolution of this proposal, I consider the integration satisfactory, what do you think?

Integration with StatisticsRegistry (#21483):

The AggregateStatisticsProvider and JoinStatisticsProvider providers consume ExpressionAnalyzer and this adds support for aggregations and joins. When no registry is injected, they delegate to the existing code path, purely additive as the rest.

Future direction: removing the injection overhead (just FYI, no action for this PR)

The re-injection loop for ExpressionAnalyzer and the separate StatisticsRegistry tree walk both exist because partition_statistics has no way to receive external context and we don't want breaking changes.
But we could piggyback on #20184 (child_stats in partition_statistics) as it plans to update partition_statistics's signature anyway.

By generalizing slightly with a StatisticsContext parameter that carries both the ExpressionAnalyzerRegistry and StatisticsRegistry providers, we would get headroom for future changes while limiting breaking changes.

This would collapse both mechanisms into the natural partition_statistics recursion, eliminating the injection walks and the separate bottom-up traversal entirely while keeping the config flags to turn either framework on or off.

The separate walks are acceptable now as a trade-off for a purely additive and non-breaking change, but this would be the natural evolution of the proposal.

New commits in detail:

I had to rebase on #21483, this will make incremental reviewing harder for you, so let me walk you through the changes since your last review (commits 9-12 of 12):

  1. Inject ExpressionAnalyzerRegistry into exec nodes and re-inject via optimizer loop - the core change addressing your feedback, includes integration tests
  2. test(filter): verify ExpressionAnalyzer inclusion-exclusion selectivity for OR predicates - unit test for OR selectivity in FilterExec
  3. refactor(expression_analyzer): delegate when no NDV statistics are available - I realized we were trying to do too much when some columns statistics are unset, now we delegate to built-in
  4. test(expression_analyzer): add StatisticsTable and end-to-end SLT for OR selectivity - end-to-end SLT tests verifying the registry survives TopK sort, UNION ALL filter pushdown, and hash join filter pushdown (queries chosen as they generate nodes that would need the ExpressionAnalyzer, exactly the gap we had before)

Looking forward to your feedback!

Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asolimando, I spotted an issue with NDV handling that can affect selectivity estimates depending on operand order.

Comment thread datafusion/physical-expr/src/expression_analyzer/default.rs Outdated
github-merge-queue bot pushed a commit that referenced this pull request Apr 13, 2026
…propagation (#21483)

## Which issue does this PR close?

- Part of #21443 (Pluggable operator-level statistics propagation)
- Part of #8227 (statistics improvements epic)

## Rationale for this change

DataFusion's built-in statistics propagation has no extension point:
downstream projects cannot inject external catalog stats, override
built-in estimation, or plug in custom strategies without forking.

This PR introduces `StatisticsRegistry`, a pluggable
chain-of-responsibility for operator-level statistics following the same
pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer`
(#21120) for expression-level stats. See #21443 for full motivation and
design context.

## What changes are included in this PR?

1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait,
`StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics`
(Statistics + type-erased extension map), `DefaultStatisticsProvider`.
`PhysicalOptimizerContext` trait with `optimize_with_context` dispatch.
`SessionState` integration.

2. Built-in providers for Filter, Projection, Passthrough
(sort/repartition/etc), Aggregate, Join
(hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities:
`num_distinct_vals`, `ndv_after_selectivity`.

3. `ClosureStatisticsProvider`: closure-based provider for test
injection and cardinality feedback.

4. JoinSelection integration: `use_statistics_registry` config flag
(default false), registry-aware `optimize_with_context`, SLT test
demonstrating plan difference on skewed data.

## Are these changes tested?

- 39 unit tests covering all providers, NDV utilities, chain priority,
and edge cases (Inexact precision, Absent propagation, Partial aggregate
delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV,
exact Cartesian product, CrossJoin, GlobalLimit skip+fetch)
- 1 SLT test (`statistics_registry.slt`): three-table join on skewed
data (8:1:1 customer_id distribution) where the built-in NDV formula
estimates 33 rows (wrong; actual=66) and the registry conservatively
estimates 100, producing the correct build-side swap

## Are there any user-facing changes?

New public API (purely additive, non-breaking):
- `StatisticsProvider` trait and `StatisticsRegistry` in
`datafusion-physical-plan`
- `ExtendedStatistics`, `StatisticsResult` types; built-in provider
structs; `num_distinct_vals`, `ndv_after_selectivity` utilities
- `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in
`datafusion-physical-optimizer`
- `SessionState::statistics_registry()`,
`SessionStateBuilder::with_statistics_registry()`
- Config: `datafusion.optimizer.use_statistics_registry` (default false)

Default behavior is unchanged. The registry is only consulted when the
flag is explicitly enabled.

Known limitations:
- Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit
boundaries are not improved: these operators call
`partition_statistics(None)` internally, re-fetching raw child stats and
discarding registry enrichment. 4 TODO comments mark the affected call
sites; #20184 would close this gap.
- No `ExpressionAnalyzer` integration yet (#21122).

---
Disclaimer: I used AI to assist in the code generation, I have manually
reviewed the output and it matches my intention and understanding.
coderfender pushed a commit to coderfender/datafusion that referenced this pull request Apr 14, 2026
…propagation (apache#21483)

## Which issue does this PR close?

- Part of apache#21443 (Pluggable operator-level statistics propagation)
- Part of apache#8227 (statistics improvements epic)

## Rationale for this change

DataFusion's built-in statistics propagation has no extension point:
downstream projects cannot inject external catalog stats, override
built-in estimation, or plug in custom strategies without forking.

This PR introduces `StatisticsRegistry`, a pluggable
chain-of-responsibility for operator-level statistics following the same
pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer`
(apache#21120) for expression-level stats. See apache#21443 for full motivation and
design context.

## What changes are included in this PR?

1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait,
`StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics`
(Statistics + type-erased extension map), `DefaultStatisticsProvider`.
`PhysicalOptimizerContext` trait with `optimize_with_context` dispatch.
`SessionState` integration.

2. Built-in providers for Filter, Projection, Passthrough
(sort/repartition/etc), Aggregate, Join
(hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities:
`num_distinct_vals`, `ndv_after_selectivity`.

3. `ClosureStatisticsProvider`: closure-based provider for test
injection and cardinality feedback.

4. JoinSelection integration: `use_statistics_registry` config flag
(default false), registry-aware `optimize_with_context`, SLT test
demonstrating plan difference on skewed data.

## Are these changes tested?

- 39 unit tests covering all providers, NDV utilities, chain priority,
and edge cases (Inexact precision, Absent propagation, Partial aggregate
delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV,
exact Cartesian product, CrossJoin, GlobalLimit skip+fetch)
- 1 SLT test (`statistics_registry.slt`): three-table join on skewed
data (8:1:1 customer_id distribution) where the built-in NDV formula
estimates 33 rows (wrong; actual=66) and the registry conservatively
estimates 100, producing the correct build-side swap

## Are there any user-facing changes?

New public API (purely additive, non-breaking):
- `StatisticsProvider` trait and `StatisticsRegistry` in
`datafusion-physical-plan`
- `ExtendedStatistics`, `StatisticsResult` types; built-in provider
structs; `num_distinct_vals`, `ndv_after_selectivity` utilities
- `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in
`datafusion-physical-optimizer`
- `SessionState::statistics_registry()`,
`SessionStateBuilder::with_statistics_registry()`
- Config: `datafusion.optimizer.use_statistics_registry` (default false)

Default behavior is unchanged. The registry is only consulted when the
flag is explicitly enabled.

Known limitations:
- Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit
boundaries are not improved: these operators call
`partition_statistics(None)` internally, re-fetching raw child stats and
discarding registry enrichment. 4 TODO comments mark the affected call
sites; apache#20184 would close this gap.
- No `ExpressionAnalyzer` integration yet (apache#21122).

---
Disclaimer: I used AI to assist in the code generation, I have manually
reviewed the output and it matches my intention and understanding.
Introduce ExpressionAnalyzer, a chain-of-responsibility framework for
expression-level statistics estimation (NDV, selectivity, min/max).

Framework:
- ExpressionAnalyzer trait with registry parameter for chain delegation
- ExpressionAnalyzerRegistry to chain analyzers (first Computed wins)
- DefaultExpressionAnalyzer: Selinger-style estimation for columns,
  literals, binary expressions, NOT, boolean predicates

Integration:
- ExpressionAnalyzerRegistry stored in SessionState, initialized once
- ProjectionExprs stores optional registry (non-breaking, no signature
  changes to project_statistics)
- ProjectionExec sets registry via Projector, injected by planner
- FilterExec uses registry for selectivity when interval analysis
  cannot handle the predicate
- Custom nodes get builtin analyzer as fallback when registry is absent
- Regenerate configs.md for new enable_expression_analyzer option
- Add enable_expression_analyzer to information_schema.slt expected output
- Fix unresolved doc links to SessionState and DefaultExpressionAnalyzer
  (cross-crate references use backticks instead of doc links)
- Simplify config description
…putation

- Fix expression_analyzer_registry doc comment misplaced between
  function_factory's doc comment and field declaration
- Fix module doc example import path (physical_plan -> physical_expr)
- Extract expression_analyzer_registry() helper in planner to avoid
  repeating the config check 4 times
- Defer left_sel/right_sel computation to AND/OR arms only, avoiding
  unnecessary sub-expression selectivity estimation for comparison
  operators
@asolimando asolimando force-pushed the asolimando/ndv-expression-analyzer branch from 54a0df7 to ae2a0b8 Compare April 15, 2026 17:43
@github-actions github-actions bot added the optimizer Optimizer rules label Apr 15, 2026
@asolimando asolimando force-pushed the asolimando/ndv-expression-analyzer branch from eb44380 to 1068f39 Compare April 15, 2026 18:30
…ptimizer loop

Add trait methods on ExecutionPlan for expression-level statistics injection
(uses_expression_level_statistics, with_expression_analyzer_registry,
expression_analyzer_registry). The physical planner injects the registry
after plan creation and re-injects after each optimizer rule that modifies
the plan, gated by the use_expression_analyzer config flag.
…ty for OR predicates

OR predicates are inherently outside interval arithmetic (a union of two
disjoint intervals cannot be represented as a single interval). This test
confirms that ExpressionAnalyzerRegistry computes the correct
inclusion-exclusion selectivity (0.28 = 0.1 + 0.2 - 0.02) on a 1000-row
input, versus the default 20% (200 rows) without a registry.
…ailable

Return Delegate for all leaf predicates when NDV is unavailable, and
propagate Delegate upward through AND/OR/NOT when any child has no
estimate. DefaultExpressionAnalyzer now only produces a result when it
has a genuine information advantage (NDV from column statistics).
… OR selectivity

Add a reusable StatisticsTable (TableProvider + ExecutionPlan with user-supplied
statistics) to the sqllogictest harness, and use it in expression_analyzer.slt
@asolimando asolimando force-pushed the asolimando/ndv-expression-analyzer branch from c37838b to 42d0f8e Compare April 15, 2026 19:41
@asolimando asolimando requested a review from kosiew April 15, 2026 20:15
@asolimando
Copy link
Copy Markdown
Member Author

@kosiew the CI error seems unrelated, the same test suite passed for macos but not amd64, I can't reproduce and it doesn't seem it can be caused by changes in this PR, it feels like a flaky test but I'd like to get your opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants