-
Notifications
You must be signed in to change notification settings - Fork 932
Add arrow_reader_clickbench
benchmark
#7470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fef38a7
to
85fff8d
Compare
This benchmark is now looking pretty nice -- it tests just the parquet reading and has all the query predicate patterns. Tomorrow I need to finish adding all the other query patterns and give it a final polish. |
23f726a
to
7ca86a7
Compare
I have all the query patterns now, but some of the queries fail because the predicates refer to the wrong columns. I will work on fixing that. |
55aa92e
to
b9cac68
Compare
6ad44dd
to
0446630
Compare
I ran this several times on @zhuqi-lucas 's PRs and all in all it is looking quite good and I think will be a useful addition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @alamb and the result is also reasonable for
#7454 (comment)
Because, the result for me here is compared the Unified select PR with the main branch(And no parquet filter pushdown).
So when we improve most of the regression for filter push down compared to no pushdown, it may also cause some regression to the original default push down, we can improve it further.
And the sync is no change because we still don't implement the sync version for the improvement PR. The improvement is for async.
/// | ||
/// [ClickBench queries]: https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/queries.sql | ||
/// [Apache DataFusion]: https://datafusion.apache.org/ | ||
struct Query { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good! Thank you @alamb
@zhuqi-lucas or @Dandandan or @etseidl , could I possibly trouble one of you for a review of this PR? I think it is ready for a review |
/// Return a map from `column_names` in `filter_columns` to the index in the schema | ||
fn column_indices(schema: &SchemaDescriptor, column_names: &Vec<&str>) -> Vec<usize> { | ||
let fields = schema.root_schema().get_fields(); | ||
let mut indicices = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut indicices = vec![]; | |
let mut indices = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you -- fixed in b9782b6
Thank you @Dandandan 🙏 I'll plan to merge this tomorrow unless anyone else would like more time to review. I am also happy to make changes in subsequent PRs as well |
let's gogogogogo |
Sorry @alamb, I intended to review, but between vacation and pestilence contracted while on vacation 😷 I haven't had the bandwidth 😢. Hopefully I'll be back on line this week 🤞 |
No worries and hope you feel better soon! |
Which issue does this PR close?
Rationale for this change
We are trying to improve the performance of row filter application in the Parquet arrow reader and part of that is a benchmark that we can use to guide optimization efforts.
Mostly I want to be able to approve improvements to the reader that will not regress other queries.
However, as discussed in #7428 the
arrow_reader_row_filter
microbenchmark doesn't currently reflect the actual performance we see in our end to end application (DataFusion).Thus, we think we need to create a benchmark that uses the actual ClickBench dataset with appropriate filtering
What changes are included in this PR?
arrow_reader_clickbench
benchmarkThis benchmark tests applying the actual clickbench filters (and column materialization) to
hits_1.parquet
(one of the data files in ClickBench)If we find additional discrepancies in performance we can increase the benchmark further.
Are there any user-facing changes?
New benchmark, and hopefully thus improved filter / projection performance, no actual code hanges
TODO