-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Minor: Extract parquet row group pruning code into its own module #4160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @Ted-Jiang @tustvold and @thinkharderdev
(this is a end of day refactor PR that took time but not brain power)
@@ -17,43 +17,36 @@ | |||
|
|||
//! Execution plan for reading Parquet files | |||
|
|||
use arrow::datatypes::SchemaRef; | |||
use fmt::Debug; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file is still pretty massive, but after this PR it is down to under 2000 lines so that is progress I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! before you split. I open this file in Clion it always lagging 😂. Thanks for your work @alamb 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also hope that some day I can split more of this code (file_format) into its own crate -- but one thing at a time !
@@ -435,7 +423,7 @@ impl FileOpener for ParquetOpener { | |||
// Row group pruning: attempt to skip entire row_groups | |||
// using metadata on the row groups | |||
let file_metadata = builder.metadata(); | |||
let row_groups = prune_row_groups( | |||
let row_groups = row_groups::prune_row_groups( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of the PR is to move this function (and its tests) into its own module
predicate: Option<PruningPredicate>, | ||
metrics: &ParquetFileMetrics, | ||
) -> Vec<usize> { | ||
// TODO: Columnar pruning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what here Columnar pruning
means 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it means the page index pruning you have implemented -- I 'll remove this comment in a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this is really necessary ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Benchmark runs are scheduled for baseline = 9c24a79 and contender = 4d621c5. 4d621c5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
re #3463
Rationale for this change
This makes the row group pruning code match the rest of the pushdown rules in DataFusion (filtering and page indexing)
What changes are included in this PR?
Move a bunch of code out of the parquet.rs module into
row_groups.rs
Are these changes tested?
Covered by existing tests
Are there any user-facing changes?
No it just moves code around. No functional change is intended