Skip to content

WIP Prototype DataPage extraction API #10843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 9, 2024

Which issue does this PR close?

Part of #10806

Rationale for this change

@marvinlanhenke asked some excellent questions on #10806 (comment)

I needed to try out a few things myself so I figured rather than

What changes are included in this PR?

  1. Sketch out a StatisticsExtractor API, and a somewhat jenky implementation
  2. Update the arrow_statistics tests to have an option to test data page statistics

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jun 9, 2024
/// ```no_run
/// tood
/// ```
pub fn data_page_mins<I>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one way the API could look like

.map(|rg_index| &page_index[*rg_index][parquet_column_index]);

// Get an iterator of the native index type depending on data type
match data_type {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to sort out how to make this look better (perhaps by following the lead of the row group iterators that make special iterators for each underlying parquet statistics datatype and then write the relevant code for converting them all to arrow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this approach, which works fine. I'd simply refactor whats mentioned in the comment - and extract data type specific iterators (like we already have for the row groups statistics). What do you think @alamb, instead of using array builder?

pub(crate) fn min_page_statistics<'a, I: Iterator<Item = Option<&'a Index>>>(
    data_type: Option<&DataType>,
    iterator: I,
) -> Result<ArrayRef> {
    // Extract this into data type specific iterator e.g. MinInt64PageStatisticsIterator
    let iter = iterator.flat_map(|opt_index| match data_type {
        Some(DataType::Int64) => match opt_index {
            Some(Index::INT64(native_index)) => native_index
                .indexes
                .iter()
                .map(|x| x.min)
                .collect::<Vec<_>>(),
            _ => vec![None],
        },
        // other data_types
        _ => todo!(),
    });

    Ok(Arc::new(Int64Array::from_iter(iter)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an array builder is likely a better approach.

}
}

/// Defines a test case for statistics extraction
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a wierd test change, but I was trying to setup a pattern where we didn't have to change all the tests at once

@@ -1724,7 +1808,7 @@ async fn test_boolean() {
.build()
.await;

Test {
TestBoth {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changing a test to TestBoth it will test both row group indexes and data page indexes

pub fn data_page_mins<I>(
&self,
page_index: &ParquetColumnIndex,
row_group_indexes: I,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll guess one reason why we want to pass in the row_group_indexes is due to the iteration over the row_group_indexes from the access_plan here.

We cannot assume we need all indices since access_plan does filter based on should_scan() or not.
Is this correct? If it is, then this was the missing piece in my prototype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll guess one reason why we want to pass in the row_group_indexes is due to the iteration over the row_group_indexes from the access_plan here.

We cannot assume we need all indices since access_plan does filter based on should_scan() or not. Is this correct? If it is, then this was the missing piece in my prototype.

Yes I think that is correct

I am still not super thrilled with this interface (mostly because it is different than row_group_mins etc that takes an interator directly.

However, I couldn't figure out how to make the types workout for making an iterator over Vec ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this interface in some version of my prototype, where it would take an iterator over all row_groups directly. However, this cannot be easily integrated with the existing code and the access_plan.row_group_indices. Perhaps, once the StatisticsConverter is fully used in page_filter.rs we can change the interface?

@alamb alamb closed this Jun 11, 2024
@alamb
Copy link
Contributor Author

alamb commented Jun 11, 2024

I think this is now inorporated into #10852

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants