Add `arrow_reader_clickbench` benchmark #7470

alamb · 2025-05-05T17:14:44Z

Which issue does this PR close?

Closes arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460
Part of [EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

Rationale for this change

We are trying to improve the performance of row filter application in the Parquet arrow reader and part of that is a benchmark that we can use to guide optimization efforts.

Mostly I want to be able to approve improvements to the reader that will not regress other queries.

However, as discussed in #7428 the arrow_reader_row_filter microbenchmark doesn't currently reflect the actual performance we see in our end to end application (DataFusion).

cargo bench --all-features --bench arrow_reader_row_filter

Thus, we think we need to create a benchmark that uses the actual ClickBench dataset with appropriate filtering

See arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460 for more details

What changes are included in this PR?

Adds a new arrow_reader_clickbench benchmark

This benchmark tests applying the actual clickbench filters (and column materialization) to

hits_1.parquet (one of the data files in ClickBench)
async and sync readers
All ClickBench query and materialization patterns

If we find additional discrepancies in performance we can increase the benchmark further.

Are there any user-facing changes?

New benchmark, and hopefully thus improved filter / projection performance, no actual code hanges

TODO

Change String types to use Utf8View
Add sync/async reader
Add hits_partitioned / hits
Complete other predicate types

alamb · 2025-05-06T20:11:41Z

This benchmark is now looking pretty nice -- it tests just the parquet reading and has all the query predicate patterns. Tomorrow I need to finish adding all the other query patterns and give it a final polish.

alamb · 2025-05-08T16:45:29Z

I have all the query patterns now, but some of the queries fail because the predicates refer to the wrong columns. I will work on fixing that.

.gitignore

alamb · 2025-05-11T11:02:23Z

I ran this several times on @zhuqi-lucas 's PRs and all in all it is looking quite good and I think will be a useful addition

Poc for adaptive parquet predicate pushdown(bitmap/range) with page cache(3 data pages) #7454 (comment)

zhuqi-lucas

LGTM, thank you @alamb and the result is also reasonable for
#7454 (comment)

Because, the result for me here is compared the Unified select PR with the main branch(And no parquet filter pushdown).

#7454 (comment)

So when we improve most of the regression for filter push down compared to no pushdown, it may also cause some regression to the original default push down, we can improve it further.

And the sync is no change because we still don't implement the sync version for the improvement PR. The improvement is for async.

zhuqi-lucas · 2025-05-11T11:38:15Z

parquet/benches/arrow_reader_clickbench.rs

+///
+/// [ClickBench queries]: https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/queries.sql
+/// [Apache DataFusion]: https://datafusion.apache.org/
+struct Query {


Very good! Thank you @alamb

alamb · 2025-05-11T11:55:52Z

@zhuqi-lucas or @Dandandan or @etseidl , could I possibly trouble one of you for a review of this PR?

I think it is ready for a review

Dandandan · 2025-05-12T12:03:46Z

parquet/benches/arrow_reader_clickbench.rs

+/// Return a map from `column_names` in `filter_columns` to the index in the schema
+fn column_indices(schema: &SchemaDescriptor, column_names: &Vec<&str>) -> Vec<usize> {
+    let fields = schema.root_schema().get_fields();
+    let mut indicices = vec![];


Suggested change

let mut indicices = vec![];

let mut indices = vec![];

Thank you -- fixed in b9782b6

…er_benchmark

alamb · 2025-05-12T17:46:29Z

Thank you @Dandandan 🙏

I'll plan to merge this tomorrow unless anyone else would like more time to review. I am also happy to make changes in subsequent PRs as well

alamb · 2025-05-13T15:13:10Z

let's gogogogogo

etseidl · 2025-05-13T15:29:06Z

Sorry @alamb, I intended to review, but between vacation and pestilence contracted while on vacation 😷 I haven't had the bandwidth 😢. Hopefully I'll be back on line this week 🤞

alamb · 2025-05-13T15:37:07Z

Sorry @alamb, I intended to review, but between vacation and pestilence contracted while on vacation 😷 I haven't had the bandwidth 😢. Hopefully I'll be back on line this week 🤞

No worries and hope you feel better soon!

github-actions bot added the parquet Changes to the parquet crate label May 5, 2025

This was referenced May 5, 2025

arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460

Closed

Update arrow_reader_row_filter benchmark to reflect ClickBench distribution #7461

Merged

alamb force-pushed the alamb/clickbench_filter_benchmark branch from fef38a7 to 85fff8d Compare May 6, 2025 20:10

alamb mentioned this pull request May 7, 2025

Improve documentation and add examples for ArrowPredicateFn #7480

Merged

alamb force-pushed the alamb/clickbench_filter_benchmark branch 2 times, most recently from 23f726a to 7ca86a7 Compare May 8, 2025 16:44

alamb mentioned this pull request May 8, 2025

Parquet decoder / decoded page Cache #7363

Open

5 tasks

zhuqi-lucas reviewed May 9, 2025

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

alamb force-pushed the alamb/clickbench_filter_benchmark branch 3 times, most recently from 55aa92e to b9cac68 Compare May 9, 2025 19:37

alamb mentioned this pull request May 9, 2025

Poc for adaptive parquet predicate pushdown(bitmap/range) with page cache(3 data pages) #7454

Open

Add arrow_reader_clickbench

0446630

alamb force-pushed the alamb/clickbench_filter_benchmark branch from 6ad44dd to 0446630 Compare May 10, 2025 11:40

This was referenced May 10, 2025

Weekly Plan: Andrew Lamb 2025-05-05 apache/datafusion#15943

Closed

Weekly Plan: Andrew Lamb 2025-05-12 apache/datafusion#16022

Closed

alamb marked this pull request as ready for review May 11, 2025 11:01

alamb requested a review from zhuqi-lucas May 11, 2025 11:02

alamb mentioned this pull request May 11, 2025

Improve performance of reading int8/int16 Parquet data #7055

Merged

alamb added 2 commits May 11, 2025 07:15

update comments, fix Q1 bug

7b45489

Polish comments

1816df2

zhuqi-lucas approved these changes May 11, 2025

View reviewed changes

Dandandan reviewed May 12, 2025

View reviewed changes

alamb added 2 commits May 12, 2025 12:25

fix typo

b9782b6

Merge remote-tracking branch 'apache/main' into alamb/clickbench_filt…

8e065e9

…er_benchmark

Dandandan approved these changes May 12, 2025

View reviewed changes

alamb merged commit 1f15130 into apache:main May 13, 2025
17 checks passed

alamb deleted the alamb/clickbench_filter_benchmark branch May 13, 2025 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `arrow_reader_clickbench` benchmark #7470

Add `arrow_reader_clickbench` benchmark #7470

Uh oh!

alamb commented May 5, 2025 •

edited

Loading

Uh oh!

alamb commented May 6, 2025

Uh oh!

alamb commented May 8, 2025

Uh oh!

Uh oh!

alamb commented May 11, 2025

Uh oh!

zhuqi-lucas left a comment •

edited

Loading

Uh oh!

zhuqi-lucas May 11, 2025

Uh oh!

alamb commented May 11, 2025

Uh oh!

Dandandan May 12, 2025

Uh oh!

alamb May 12, 2025

Uh oh!

alamb commented May 12, 2025

Uh oh!

alamb commented May 13, 2025

Uh oh!

Uh oh!

etseidl commented May 13, 2025

Uh oh!

alamb commented May 13, 2025

Uh oh!

Uh oh!

Add arrow_reader_clickbench benchmark #7470

Add arrow_reader_clickbench benchmark #7470

Uh oh!

Conversation

alamb commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

TODO

Uh oh!

alamb commented May 6, 2025

Uh oh!

alamb commented May 8, 2025

Uh oh!

Uh oh!

alamb commented May 11, 2025

Uh oh!

zhuqi-lucas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas May 11, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented May 11, 2025

Uh oh!

Dandandan May 12, 2025

Choose a reason for hiding this comment

Uh oh!

alamb May 12, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented May 12, 2025

Uh oh!

alamb commented May 13, 2025

Uh oh!

Uh oh!

etseidl commented May 13, 2025

Uh oh!

alamb commented May 13, 2025

Uh oh!

Uh oh!

Add `arrow_reader_clickbench` benchmark #7470

Add `arrow_reader_clickbench` benchmark #7470

alamb commented May 5, 2025 •

edited

Loading

zhuqi-lucas left a comment •

edited

Loading