-
Notifications
You must be signed in to change notification settings - Fork 928
Introduce ReadPlan
to encapsulate the calculation of what parquet rows to decode
#7502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a great improvement, thank you @alamb !
|
ReadPlan
to encapsulate the calculation of what rows to decodeReadPlan
to encapsulate the calculation of what parquet rows to decode
Thanks @zhuqi-lucas -- I would like to see if I can make a POC that shows I can use this structure to make the performance of the reader faster, which I hope to do over the next day or two |
Update: I started on a POC of using this structure in this PR (the idea is to store filtered results on the ReadPlan) |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
1 similar comment
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
1 similar comment
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
1 similar comment
🤖 |
@@ -679,38 +680,32 @@ impl<T: ChunkReader + 'static> ParquetRecordBatchReaderBuilder<T> { | |||
}; | |||
|
|||
let mut filter = self.filter; | |||
let mut selection = self.selection; | |||
let mut plan_builder = ReadPlanBuilder::new(batch_size).with_selection(self.selection); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The high level idea is to avoid manipulating the RowSelection
directly in the APIs and instead manipulate them via ReadPlanBuilder
/ ReadPlan
The reason to encapsulate them in a new struct is to make it feasible to add additional complexity such as cached filter results and better filter row representations
This PR simply moves code around -- it does not intended to change any of the actual logic yet
@@ -814,9 +807,10 @@ impl ParquetRecordBatchReader { | |||
/// simplify error handling with `?` | |||
fn next_inner(&mut self) -> Result<Option<RecordBatch>> { | |||
let mut read_records = 0; | |||
match self.selection.as_mut() { | |||
let batch_size = self.batch_size(); | |||
match self.read_plan.selection_mut() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is in future PRs we will change this control loop so it can use BooleanArray
when that is a more efficient representation. Specifically, I think we can follow the model @zhuqi-lucas did in #7454
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb , the POC for the adaptive predicate pushdown based this PR is good:
|
||
/// Returns `true` if `selection` is `None` or selects some rows | ||
pub(crate) fn selects_any(selection: Option<&RowSelection>) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is moved into ReadPlanBuilder
I think this PR is ready for review. I think @zhuqi-lucas has reviewed it though may want to do so again Maybe @XiangpengHao , @Dandandan or @etseidl could be interested in reviewing as well |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good as far as I can tell. Can't wait for the sequel 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thank you @alamb !
Co-authored-by: Ed Seidl <[email protected]>
Thank you for the reviews @zhuqi-lucas and @etseidl I'll plan to merge this tomorrow unless anyone else would like time to review |
onwards! |
Which issue does this PR close?
Rationale for this change
We are trying to improve the performance of filter pushdown in the Parquet reader
This is currently challening for at least a few reasons:
RowSelection
,RowFilter
, limit, offset).This makes the complexity of trying to implement more sophisticated strategies for iteration very hard (see @zhuqi-lucas 's comment here for example #7454 (comment))
The high level plan to improve filter performance is a combination of
RowFilter
(aka Adaptive Parquet Predicate Pushdown Evaluation #5523)This I would like a place to put all "what rows to read" information so we can implement more sophisticated strategies without getting too hung up.
Here is an example of how this code is used when caching results:
What changes are included in this PR?
RowSelection
manipulation code intoReadPlan
and `ReadPlanBuilderIn addition to preparing for faster filter pushdown, I also happen to think this makes the code easier to understand so it would be a useful change in its own right
Are there any user-facing changes?
No. This is entirely an internal code reorganization