WIP Prototype DataPage extraction API #10843

alamb · 2024-06-09T21:28:03Z

Which issue does this PR close?

Part of #10806

Rationale for this change

@marvinlanhenke asked some excellent questions on #10806 (comment)

I needed to try out a few things myself so I figured rather than

What changes are included in this PR?

Sketch out a StatisticsExtractor API, and a somewhat jenky implementation
Update the arrow_statistics tests to have an option to test data page statistics

Are these changes tested?

Yes

Are there any user-facing changes?

alamb · 2024-06-09T21:28:47Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+    /// ```no_run
+    /// tood
+    /// ```
+    pub fn data_page_mins<I>(


This is one way the API could look like

alamb · 2024-06-09T21:30:18Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+            .map(|rg_index| &page_index[*rg_index][parquet_column_index]);
+
+        // Get an iterator of the native index type depending on data type
+        match data_type {


we need to sort out how to make this look better (perhaps by following the lead of the row group iterators that make special iterators for each underlying parquet statistics datatype and then write the relevant code for converting them all to arrow

I tried this approach, which works fine. I'd simply refactor whats mentioned in the comment - and extract data type specific iterators (like we already have for the row groups statistics). What do you think @alamb, instead of using array builder?

pub(crate) fn min_page_statistics<'a, I: Iterator<Item = Option<&'a Index>>>( data_type: Option<&DataType>, iterator: I, ) -> Result<ArrayRef> { // Extract this into data type specific iterator e.g. MinInt64PageStatisticsIterator let iter = iterator.flat_map(|opt_index| match data_type { Some(DataType::Int64) => match opt_index { Some(Index::INT64(native_index)) => native_index .indexes .iter() .map(|x| x.min) .collect::<Vec<_>>(), _ => vec![None], }, // other data_types _ => todo!(), }); Ok(Arc::new(Int64Array::from_iter(iter)))

I think an array builder is likely a better approach.

alamb · 2024-06-09T21:30:44Z

datafusion/core/tests/parquet/arrow_statistics.rs

    }
 }

-/// Defines a test case for statistics extraction


This is kind of a wierd test change, but I was trying to setup a pattern where we didn't have to change all the tests at once

alamb · 2024-06-09T21:31:07Z

datafusion/core/tests/parquet/arrow_statistics.rs

@@ -1724,7 +1808,7 @@ async fn test_boolean() {
    .build()
    .await;

-    Test {
+    TestBoth {


By changing a test to TestBoth it will test both row group indexes and data page indexes

marvinlanhenke · 2024-06-10T04:25:05Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+    pub fn data_page_mins<I>(
+        &self,
+        page_index: &ParquetColumnIndex,
+        row_group_indexes: I,


I'll guess one reason why we want to pass in the row_group_indexes is due to the iteration over the row_group_indexes from the access_plan here.

We cannot assume we need all indices since access_plan does filter based on should_scan() or not.
Is this correct? If it is, then this was the missing piece in my prototype.

I'll guess one reason why we want to pass in the row_group_indexes is due to the iteration over the row_group_indexes from the access_plan here.

We cannot assume we need all indices since access_plan does filter based on should_scan() or not. Is this correct? If it is, then this was the missing piece in my prototype.

Yes I think that is correct

I am still not super thrilled with this interface (mostly because it is different than row_group_mins etc that takes an interator directly.

However, I couldn't figure out how to make the types workout for making an iterator over Vec ...

I had this interface in some version of my prototype, where it would take an iterator over all row_groups directly. However, this cannot be easily integrated with the existing code and the access_plan.row_group_indices. Perhaps, once the StatisticsConverter is fully used in page_filter.rs we can change the interface?

alamb · 2024-06-11T01:21:22Z

I think this is now inorporated into #10852

alamb added 2 commits June 9, 2024 17:04

Sketch out API for datapage statistics + test

f0a6790

jenky implementation

4cd0540

github-actions bot added the core Core DataFusion crate label Jun 9, 2024

alamb commented Jun 9, 2024

View reviewed changes

alamb mentioned this pull request Jun 9, 2024

Efficiently and correctly Extract Page Index statistics into ArrayRefs #10806

Closed

marvinlanhenke reviewed Jun 10, 2024

View reviewed changes

alamb closed this Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP Prototype DataPage extraction API #10843

WIP Prototype DataPage extraction API #10843

Uh oh!

alamb commented Jun 9, 2024

Uh oh!

alamb Jun 9, 2024

Uh oh!

alamb Jun 9, 2024

Uh oh!

marvinlanhenke Jun 10, 2024

Uh oh!

alamb Jun 10, 2024

Uh oh!

alamb Jun 9, 2024

Uh oh!

alamb Jun 9, 2024

Uh oh!

marvinlanhenke Jun 10, 2024

Uh oh!

alamb Jun 10, 2024

Uh oh!

marvinlanhenke Jun 10, 2024

Uh oh!

alamb commented Jun 11, 2024

Uh oh!

Uh oh!

WIP Prototype DataPage extraction API #10843

WIP Prototype DataPage extraction API #10843

Uh oh!

Conversation

alamb commented Jun 9, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 11, 2024

Uh oh!

Uh oh!