Resolve MIN/MAX from Parquet metadata for Single-mode aggregates and CAST projections#21651
Resolve MIN/MAX from Parquet metadata for Single-mode aggregates and CAST projections#21651Dandandan wants to merge 4 commits intoapache:mainfrom
Conversation
…CAST projections
Two changes enable the AggregateStatistics optimizer to resolve
MIN/MAX (and COUNT) from file metadata instead of scanning data:
1. `take_optimizable` now handles `Single`/`SinglePartitioned`
aggregate modes, not just `Final` wrapping `Partial`. This matters
for single-partition scans where the planner skips the
Partial→Final split.
2. `project_statistics` now propagates min/max statistics through
CAST expressions (recursively), so that projections like
`CAST(CAST(col AS Int32) AS Date32)` preserve the underlying
column statistics instead of returning unknown.
Together these turn ClickBench Q6
(`SELECT MIN("EventDate"), MAX("EventDate") FROM hits`) from a full
column scan into a metadata-only lookup (PlaceholderRowExec).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmark clickbench_partitioned |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing aggregate-stats-single-mode-and-cast (38736c2) to 5c653be (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing aggregate-stats-single-mode-and-cast (aa1db03) to 5c653be (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing aggregate-stats-single-mode-and-cast (aa1db03) to 5c653be (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing aggregate-stats-single-mode-and-cast (aa1db03) to 5c653be (merge-base) diff using: tpch File an issue against this benchmark runner |
…T-specific code Replace the CAST-specific statistics propagation with a generic approach using PhysicalExpr::evaluate_bounds(). This works for any expression that implements evaluate_bounds — including CAST, negation, and arithmetic with literals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aa1db03 to
077c595
Compare
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
The PhysicalExpr trait no longer exposes as_any() after upstream changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Which issue does this PR close?
Related to improving ClickBench performance (metadata-only query resolution)
Rationale for this change
ClickBench Q6 (
SELECT MIN("EventDate"), MAX("EventDate") FROM hits) was doing a full column scan despite the answer being available in Parquet row group statistics. Two issues prevented theAggregateStatisticsoptimizer from firing:take_optimizablemissedSinglemode — it only matched theFinal → Partialpair, not single-partition scans.project_statisticsreturnedunknownfor any non-Column/Literal expression, discarding Parquet min/max through casts likeCAST(CAST(EventDate AS Int32) AS Date32).This now avoids scanning any columns, going from ~6ms to ~1.5ms
What changes are included in this PR?
aggregate_statistics.rs:take_optimizablenow also matchesSingle/SinglePartitionedaggregates.projection.rs: Addedproject_column_statistics_through_expr()which propagates min/max statistics throughCastExpr.Result: Q6 now resolves entirely from Parquet metadata (zero I/O).
Are these changes tested?
Yes — existing tests pass, ClickBench sqllogictest updated with new expected plan for Q6.
Are there any user-facing changes?
Scalar
MIN/MAXaggregates over CAST projections now resolve from file metadata when statistics are available, avoiding unnecessary I/O.🤖 Generated with Claude Code