-
Notifications
You must be signed in to change notification settings - Fork 926
[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I just spoke with @XiangpengHao -- from my perspective the current status is:
|
Fascinatingly, Clickbench released a blog post recently about their parquet pushdown work https://clickhouse.com/blog/clickhouse-and-parquet-a-foundation-for-fast-lakehouse-analytics Possibly even more interesting is that they link to a master's thesis from Peter Boncz's group about how to quickly evaluate predicates during Parquet Decoding: https://homepages.cwi.nl/~boncz/msc/2018-BoudewijnBraams.pdf This thesis directly addresses some of the work we are considering (though they only consider Selection masks (bitmask) and Selection Vector (selected indices) |
@zhuqi-lucas has a great insight in #7454 -- namely that instead of a two pass algorithm (evaluate The current flow goes something like:
The current PR starts heading down a slightly modified flow, where the RowSelection and RowFilters are not combined. I think a combined solution would look something like:
Decoding proceeds like:
|
@zhuqi-lucas and I have been working on various strategies / structures to make the filtering faster. I believe we now have evidence enough to proceed with a more sophisticated implementation Specifically,
Thus my next steps will be:
Perhaps then @zhuqi-lucas can help port the hybrid Filter/RowSelection to the |
Thank you @alamb , i will help port the hybrid Filter/RowSelection to the ReadPlan to get better performance without changing any public interfaces and also testing, review. |
Thank you so much @zhuqi-lucas -- it is great working with you |
Thank you @alamb , i tried the draft POC for adaptive predicate pushdown based read plan PR today, it shows good result: |
Update: I am quite happy with how reusing filtered results is coming along, see: I also had some thoughts on reducing the buffering required here: #6692 (comment) |
Status update The high level plan to improve performance has two parts:
My main concern about resuing the result of filtering is memory usage and I think it is important to keep the usage to a minimum -- the current APIs ( We also need to implement more sophisticated logic when there are multiple predicates. My next steps are:
|
Uh oh!
There was an error while loading. Please reload this page.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When evaluating filters on data stored in parquet, you can:
with_row_filter
API to apply predicates during the scanfilter
kernel afterwardsCurrently, it is faster to use
with_row_filter
for some predicates andfilter
for others. In DataFusion we have a configuration setting to choose between the strategies (filter_pushdown
, see apache/datafusion#3463) but that is a bad UX as itmeans the user must somehow know which strategy to choose, but the strategy changes
In general the queries that are slower when
with_row_filter
is used:More Background:
The predicates are provides as a
RowFilter
(see docs for more details)Describe the solution you'd like
I would like the evaluation of predicates in
RowFilter
(aka pushed down predicates) to never be worse than decoding the columns first and then filtering them with thefilter
kernelWe have added a benchmark #7401, which hopefully can
Describe alternatives you've considered
This goal will likely require several changes to the codebase. Here are some options:
skip
fromRowSelector
#7450RowSelection::and_then
is slow #7458Arc<dyn Array>
in parquet record batch reader. #4864The text was updated successfully, but these errors were encountered: