Skip to content

Add arrow_reader_clickbench benchmark #7470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 13, 2025

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 5, 2025

Which issue does this PR close?

Rationale for this change

We are trying to improve the performance of row filter application in the Parquet arrow reader and part of that is a benchmark that we can use to guide optimization efforts.

Mostly I want to be able to approve improvements to the reader that will not regress other queries.

However, as discussed in #7428 the arrow_reader_row_filter microbenchmark doesn't currently reflect the actual performance we see in our end to end application (DataFusion).

cargo bench --all-features --bench arrow_reader_row_filter

Thus, we think we need to create a benchmark that uses the actual ClickBench dataset with appropriate filtering

What changes are included in this PR?

  1. Adds a new arrow_reader_clickbench benchmark

This benchmark tests applying the actual clickbench filters (and column materialization) to

  1. hits_1.parquet (one of the data files in ClickBench)
  2. async and sync readers
  3. All ClickBench query and materialization patterns

If we find additional discrepancies in performance we can increase the benchmark further.

Are there any user-facing changes?

New benchmark, and hopefully thus improved filter / projection performance, no actual code hanges

TODO

  • Change String types to use Utf8View
  • Add sync/async reader
  • Add hits_partitioned / hits
  • Complete other predicate types

@alamb
Copy link
Contributor Author

alamb commented May 6, 2025

This benchmark is now looking pretty nice -- it tests just the parquet reading and has all the query predicate patterns. Tomorrow I need to finish adding all the other query patterns and give it a final polish.

@alamb alamb force-pushed the alamb/clickbench_filter_benchmark branch 2 times, most recently from 23f726a to 7ca86a7 Compare May 8, 2025 16:44
@alamb
Copy link
Contributor Author

alamb commented May 8, 2025

I have all the query patterns now, but some of the queries fail because the predicates refer to the wrong columns. I will work on fixing that.

@alamb alamb force-pushed the alamb/clickbench_filter_benchmark branch from 6ad44dd to 0446630 Compare May 10, 2025 11:40
@alamb alamb marked this pull request as ready for review May 11, 2025 11:01
@alamb
Copy link
Contributor Author

alamb commented May 11, 2025

I ran this several times on @zhuqi-lucas 's PRs and all in all it is looking quite good and I think will be a useful addition

Copy link
Contributor

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @alamb and the result is also reasonable for
#7454 (comment)

Because, the result for me here is compared the Unified select PR with the main branch(And no parquet filter pushdown).

#7454 (comment)

So when we improve most of the regression for filter push down compared to no pushdown, it may also cause some regression to the original default push down, we can improve it further.

And the sync is no change because we still don't implement the sync version for the improvement PR. The improvement is for async.

///
/// [ClickBench queries]: https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/queries.sql
/// [Apache DataFusion]: https://datafusion.apache.org/
struct Query {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good! Thank you @alamb

@alamb
Copy link
Contributor Author

alamb commented May 11, 2025

@zhuqi-lucas or @Dandandan or @etseidl , could I possibly trouble one of you for a review of this PR?

I think it is ready for a review

/// Return a map from `column_names` in `filter_columns` to the index in the schema
fn column_indices(schema: &SchemaDescriptor, column_names: &Vec<&str>) -> Vec<usize> {
let fields = schema.root_schema().get_fields();
let mut indicices = vec![];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut indicices = vec![];
let mut indices = vec![];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you -- fixed in b9782b6

@alamb
Copy link
Contributor Author

alamb commented May 12, 2025

Thank you @Dandandan 🙏

I'll plan to merge this tomorrow unless anyone else would like more time to review. I am also happy to make changes in subsequent PRs as well

@alamb
Copy link
Contributor Author

alamb commented May 13, 2025

let's gogogogogo

@alamb alamb merged commit 1f15130 into apache:main May 13, 2025
17 checks passed
@etseidl
Copy link
Contributor

etseidl commented May 13, 2025

Sorry @alamb, I intended to review, but between vacation and pestilence contracted while on vacation 😷 I haven't had the bandwidth 😢. Hopefully I'll be back on line this week 🤞

@alamb
Copy link
Contributor Author

alamb commented May 13, 2025

Sorry @alamb, I intended to review, but between vacation and pestilence contracted while on vacation 😷 I haven't had the bandwidth 😢. Hopefully I'll be back on line this week 🤞

No worries and hope you feel better soon!

@alamb alamb deleted the alamb/clickbench_filter_benchmark branch May 13, 2025 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

arrow_reader_row_filter benchmark doesn't capture page cache improvements
4 participants