-
Notifications
You must be signed in to change notification settings - Fork 1.5k
WIP Prototype DataPage extraction API #10843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// ```no_run | ||
/// tood | ||
/// ``` | ||
pub fn data_page_mins<I>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one way the API could look like
.map(|rg_index| &page_index[*rg_index][parquet_column_index]); | ||
|
||
// Get an iterator of the native index type depending on data type | ||
match data_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to sort out how to make this look better (perhaps by following the lead of the row group iterators that make special iterators for each underlying parquet statistics datatype and then write the relevant code for converting them all to arrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this approach, which works fine. I'd simply refactor whats mentioned in the comment - and extract data type specific iterators (like we already have for the row groups statistics). What do you think @alamb, instead of using array builder?
pub(crate) fn min_page_statistics<'a, I: Iterator<Item = Option<&'a Index>>>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef> {
// Extract this into data type specific iterator e.g. MinInt64PageStatisticsIterator
let iter = iterator.flat_map(|opt_index| match data_type {
Some(DataType::Int64) => match opt_index {
Some(Index::INT64(native_index)) => native_index
.indexes
.iter()
.map(|x| x.min)
.collect::<Vec<_>>(),
_ => vec![None],
},
// other data_types
_ => todo!(),
});
Ok(Arc::new(Int64Array::from_iter(iter)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an array builder is likely a better approach.
} | ||
} | ||
|
||
/// Defines a test case for statistics extraction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of a wierd test change, but I was trying to setup a pattern where we didn't have to change all the tests at once
@@ -1724,7 +1808,7 @@ async fn test_boolean() { | |||
.build() | |||
.await; | |||
|
|||
Test { | |||
TestBoth { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By changing a test to TestBoth
it will test both row group indexes and data page indexes
pub fn data_page_mins<I>( | ||
&self, | ||
page_index: &ParquetColumnIndex, | ||
row_group_indexes: I, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll guess one reason why we want to pass in the row_group_indexes
is due to the iteration over the row_group_indexes from the access_plan here.
We cannot assume we need all indices since access_plan does filter based on should_scan()
or not.
Is this correct? If it is, then this was the missing piece in my prototype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll guess one reason why we want to pass in the
row_group_indexes
is due to the iteration over the row_group_indexes from the access_plan here.We cannot assume we need all indices since access_plan does filter based on
should_scan()
or not. Is this correct? If it is, then this was the missing piece in my prototype.
Yes I think that is correct
I am still not super thrilled with this interface (mostly because it is different than row_group_mins
etc that takes an interator directly.
However, I couldn't figure out how to make the types workout for making an iterator over Vec ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had this interface in some version of my prototype, where it would take an iterator over all row_groups directly. However, this cannot be easily integrated with the existing code and the access_plan.row_group_indices. Perhaps, once the StatisticsConverter is fully used in page_filter.rs
we can change the interface?
I think this is now inorporated into #10852 |
Which issue does this PR close?
Part of #10806
Rationale for this change
@marvinlanhenke asked some excellent questions on #10806 (comment)
I needed to try out a few things myself so I figured rather than
What changes are included in this PR?
StatisticsExtractor
API, and a somewhat jenky implementationarrow_statistics
tests to have an option to test data page statisticsAre these changes tested?
Yes
Are there any user-facing changes?