-
Notifications
You must be signed in to change notification settings - Fork 931
POC: Sketch out cached filter result API #7513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
78f96d1
to
31f2fa1
Compare
31f2fa1
to
244e187
Compare
filters: Vec<BooleanArray>, | ||
} | ||
|
||
impl CachedPredicateResultBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is very clear to get the cached result!
8961196
to
9e91e9f
Compare
/// TODO: potentially incrementally build the result of the predicate | ||
/// evaluation without holding all the batches in memory. See | ||
/// <https://github.com/apache/arrow-rs/issues/6692> | ||
in_progress_arrays: Vec<Box<dyn InProgressArray>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb ,
Does it mean, this in_progress_arrays is not the final result for us to generate the final batch?
For example:
Predicate a > 1 => in_progress_array_a filtered by a > 1
Predicate b >2 => in_progress_array_b filtered by b > 2 also based filtered by a > 1, but we don't update the in_progress_array_a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent question
What I was thinking is that CachedPredicateResult
would manage the "currently" applied predicate
So in the case where there are multiple predicates, I was thinking of a method like CachedPredicateResult::merge
method which could take the result of filtering a
and apply the result of filtering by b
We can then put heuristics / logic for if/when we materialize the filters into CachedPredicateResult
But that is sort of speculation at this point -- I don't have it all worked out yet
My plan is to get far enough to show this structure works and can improve performance, and then I'll work on the trickier logic of applying multiple filters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CachedPredicateResult::merge method which could take the result of filtering a and apply the result of filtering by b
Great idea!
But that is sort of speculation at this point -- I don't have it all worked out yet
Sure, i will continue to review, thank you @alamb !
9e91e9f
to
147c7a7
Compare
I tested this branch using a query that filters and selects the same column (NOTE it is critical to NOT use cargo bench --features="arrow async" --bench arrow_reader_clickbench -- Q24 Here are the benchmark results (30ms --> 22ms) (25 % faster)
I realize this branch currently uses more memory (to buffer the filter results), but I think the additional memory growth can be limited with a setting. |
147c7a7
to
f1f7103
Compare
Amazing result , i think it will be the perfect way instead of page cache, because page caching will have cache missing, but this PR will always cache the result! |
Thanks -- I think one potential problem is that the cached results may consume too much memory (but I will try and handle that shortly) I think we should proceed with starting to merge some refactorings; I left some suggestions here: |
It makes sense! Thank you @alamb. |
🤖 |
🤖: Benchmark completed Details
|
It seems regression for Q36/Q37. |
Yes, I agree -- I will figure out why |
f1f7103
to
a0e4b29
Compare
I did some profiling: samply record target/release/deps/arrow_reader_clickbench-aef15514767c9665 --bench arrow_reader_clickbench/sync/Q36 Basically, the issue is that calling I added some printlns and it seems like we have 181k rows in total that pass but the number of buffers is crazy (I think this is related to concat not compacting the ByteViewArray). Working on this...
|
🤖 |
🤖: Benchmark completed Details
|
Well, that is looking quite a bit better I am now working on a way to reduce buffering requirements (will require incremental concat'ing) |
4dd1d69
to
893819a
Compare
Amazing result @alamb , it looks pretty cool! |
0d358f2
to
5be48ac
Compare
Ok, I reworked a bunch of the code in this PR so it is now structured to use a I will continue working on this tomorrow. Now I need to go do other things and reviews, etc |
REwork to be interms of IncremntalRecordBatchBuilder
5be48ac
to
2dd6bf2
Compare
Draft until:
Which issue does this PR close?
ReadPlan
to encapsulate the calculation of what parquet rows to decode #7502Rationale for this change
I am trying to sketch out enough of a cached filter result API to show performance improvements. Once I have done that, I will start proposing how to break it up into smaller PRs
What changes are included in this PR?
Are there any user-facing changes?