Poc for adaptive parquet predicate pushdown(bitmap/range) with page cache(3 data pages) #7454

zhuqi-lucas · 2025-04-29T04:39:22Z

Which issue does this PR close?

Related to Adaptive Parquet Predicate Pushdown Evaluation #5523
and Parquet decoder / decoded page Cache #7363

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

zhuqi-lucas · 2025-05-11T15:57:18Z

Only one remaining CI testing failed, i still can't find the root cause:

#[tokio::test]
#[cfg(feature = "snap")]
async fn test_plaintext_footer_read_without_decryption() {
    crate::encryption_agnostic::read_plaintext_footer_file_without_decryption_properties_async()
        .await;
}

alamb · 2025-05-12T10:02:24Z

Looks sweet -- thank you @zhuqi-lucas -- I am going to start reviewing this PR to get a feel for its code and be ready to merge / get it ready

zhuqi-lucas · 2025-05-12T10:11:47Z

Only one remaining CI testing failed, i still can't find the root cause:

#[tokio::test]
#[cfg(feature = "snap")]
async fn test_plaintext_footer_read_without_decryption() {
    crate::encryption_agnostic::read_plaintext_footer_file_without_decryption_properties_async()
        .await;
}

Looks sweet -- thank you @zhuqi-lucas -- I am going to start reviewing this PR to get a feel for its code and be ready to merge / get it ready

Thank you @alamb, i found the root cause of above fail, and also in progress make everything green.

alamb · 2025-05-12T10:40:00Z

Only one remaining CI testing failed, i still can't find the root cause:
#[tokio::test]
#[cfg(feature = "snap")]
async fn test_plaintext_footer_read_without_decryption() {
    crate::encryption_agnostic::read_plaintext_footer_file_without_decryption_properties_async()
        .await;
}
Looks sweet -- thank you @zhuqi-lucas -- I am going to start reviewing this PR to get a feel for its code and be ready to merge / get it ready

Thank you @alamb, i found the root cause of above fail, and also in progress make everything green.

You are the best @zhuqi-lucas -- I am reviewing this PR now

alamb

First of all, thank you so much @zhuqi-lucas

I really like your code in FilteredParquetRecordBatchReader -- the idea of combining the application of the RowFilter and the decoding of the projection into a single reader I think is a key insight and maybe points the way towards not decoding twce

After reviewing this code, it seems to me that a lot of work is done to use the same RowSelection structure for both

Skipping large contiguous chunks of rows (e.g row groups and entire pages)
Applying a RowFilter for filtering individual rows

I think RowSelection is well designed for the former, but quite bad for the latter (applying RowFilter)

As this PR starts down the path of separating the two concerns, I wonder if you have thought about pushing it even farther ? Something like keeping the results of the RowFilter only as BooleanArrays and then progressively decoding the remaining projections?

alamb · 2025-05-12T12:47:01Z

parquet/src/arrow/async_reader/arrow_reader.rs

+            let array = reader.consume_batch()?;
+
+            let filtered_array =
+                filter(&array, bitmap).map_err(|e| ParquetError::General(e.to_string()))?;


This is a a very clever idea (to keep the filter and apply it to the next decoded batch)

alamb · 2025-05-12T12:47:50Z

parquet/src/arrow/async_reader/arrow_reader.rs

+        self.row_filter.take()
+    }
+
+    fn create_bitmap_from_ranges(&mut self, runs: &[RowSelector]) -> BooleanArray {


This code path is unfortunate -- converting from Butmap --> RowSelection. I have some ideas about how this could be better if we avoided this particular code path.

Thank you @alamb , very good point, i was also thinking if we can return the bitmap from the predicate filter, we will have better performance. I will try to do this improvement also.

alamb · 2025-05-12T12:56:28Z

parquet/src/arrow/async_reader/arrow_reader.rs

+
+pub struct FilteredParquetRecordBatchReader {
+    batch_size: usize,
+    array_reader: Box<dyn ArrayReader>,


This is a really nice idea -- to have both array_readers and predicate_readers in the same structure. If we push this idea even more this idea could be used to avoid the second decode entirely and not modify RowSelection at all.

alamb · 2025-05-12T13:08:53Z

parquet/src/arrow/arrow_reader/selection.rs

+/// [`RowSelection`] is an enum that can be either a list of [`RowSelector`]s
+/// or a [`BooleanArray`] bitmap
+#[derive(Debug, Clone, PartialEq)]
+pub enum RowSelection {


Given the two different representations have different uses

Skip contiguous ranges (basically skip entire data pages)

filter out individual rows

I think we may be able to actually keep these as two separate structs rather than combining them into a single struct.

Given the two different representations have different uses

Skip contiguous ranges (basically skip entire data pages)

filter out individual rows

I think we may be able to actually keep these as two separate structs rather than combining them into a single struct.

Good suggestions, i am also feeling it's strange here, we'd better to use two structs.

zhuqi-lucas · 2025-05-12T13:51:18Z

First of all, thank you so much @zhuqi-lucas

I really like your code in FilteredParquetRecordBatchReader -- the idea of combining the application of the RowFilter and the decoding of the projection into a single reader I think is a key insight and maybe points the way towards not decoding twce

After reviewing this code, it seems to me that a lot of work is done to use the same RowSelection structure for both

Skipping large contiguous chunks of rows (e.g row groups and entire pages)

Applying a RowFilter for filtering individual rows

I think RowSelection is well designed for the former, but quite bad for the latter (applying RowFilter)

As this PR starts down the path of separating the two concerns, I wonder if you have thought about pushing it even farther ? Something like keeping the results of the RowFilter only as BooleanArrays and then progressively decoding the remaining projections?

Thank you @alamb fo review, i agree we can go further for this PR, i will try to do it.

And the key improvement to reduce the regression is when average selection number is < 10, we will fallback to read all the row then to filter, and which is faster because it's vectorized better.

                    if total < 10 * select_count {
                        // Bitmap branch
                        let bitmap = self.create_bitmap_from_ranges(&runs);
                        match self.array_reader.read_records(bitmap.len()) {
                            Ok(_) => {}
                            Err(e) => return Some(Err(e.into())),
                        };
                        mask_builder.append_buffer(bitmap.values());
                        rows_accum += bitmap.true_count();
                    }

I agree create_bitmap_from_ranges has some overhead, if we can return the bitmap from the predicate filter, we will have better performance. I will try to do this improvement also.

alamb · 2025-05-12T13:57:42Z

Thank you @alamb fo review, i agree we can go further for this PR, i will try to do it.

I wrote up an idea here: #7456 (comment) -- I think the design sketched out there (and inspired by this PR) would always be as good or better as the current DataFusion approach of decode + filter and thus we could turn it on by default

I wonder what you think? I would love to help / collaborate with you

alamb · 2025-05-12T13:58:55Z

I suggest we create a new PR for the next POC (unified filter + decoder). I am happy to do so but if you make one it might be easier to collaborate

zhuqi-lucas · 2025-05-12T14:02:23Z

I suggest we create a new PR for the next POC (unified filter + decoder). I am happy to do so but if you make one it might be easier to collaborate

Great! Thanks @alamb, i'd like to try this!

alamb · 2025-05-12T14:02:33Z

Let's do it!

alamb · 2025-05-12T14:08:44Z

As an aside, another thing that keeps coming up in these designs is this primitive:

Optimize take/filter/concat from multiple input arrays to a single large output array #6692

It will potentially show up here again too: when we need to build up a final array after applying filters at the moment the code will be forced to do filter followed by concat

zhuqi-lucas · 2025-05-12T14:16:52Z

As an aside, another thing that keeps coming up in these designs is this primitive:

Optimize take/filter from multiple input arrays to a single large output array #6692

It will potentially show up here again too: when we need to build up a final array after applying filters at the moment the code will be forced to do filter followed by concat

May be we can do this also in the arrow low level. So the datafusion can benefit from?

alamb · 2025-05-12T18:52:56Z

As an aside, another thing that keeps coming up in these designs is this primitive:

Optimize take/filter from multiple input arrays to a single large output array #6692

It will potentially show up here again too: when we need to build up a final array after applying filters at the moment the code will be forced to do filter followed by concat

May be we can do this also in the arrow low level. So the datafusion can benefit from?

100%

I am evaluating if I should try and work on this or not. I already feel spread quite thin

However, if you are going to take the lead of parquet perdicate evaluation it might be a good option for me to work on while waiting to review your PRs 🤔

zhuqi-lucas · 2025-05-13T09:45:11Z

Hi @alamb, I tried today and try to submit a new PR for #7456 (comment) , but it seems hard for me to wrapper the new way, it's more complex than i expected, feel free to take it. I also can help review, optimize and testing, thanks!

I was trying something like this, and it's hard for me to integrate all the workflow and corner cases:

fn evaluate_predicate_batch(
    batch_size: usize,
    mut filter_reader: ParquetRecordBatchReader,
    mut predicates: Vec<Box<dyn ArrowPredicate>>,
) -> Result<BooleanArray, ArrowError> {
    let mut passing = Vec::with_capacity(batch_size);
    let mut total_selected = 0;
    let mut batches = Vec::new();
    while total_selected < batch_size {
        match filter_reader.next() {
            Some(Ok(batch)) => {
                // Apply predicates sequentially and combine with AND
                let mut combined_mask: Option<BooleanArray> = None;

                for predicate in predicates.iter_mut() {
                    let mask = predicate.evaluate(batch.clone())?;
                    if mask.len() != batch.num_rows() {
                        return Err(ArrowError::ComputeError(format!(
                            "Predicate returned {} rows, expected {}",
                            mask.len(),
                            batch.num_rows()
                        )));
                    }
                    combined_mask = match combined_mask {
                        Some(prev) => Some(and(&prev, &mask)?),
                        None => Some(mask),
                    };
                }

                if let Some(mask) = combined_mask {
                    batches.push(filter_record_batch(
                        &batch,
                        &mask));
                    total_selected += mask.true_count();
                    passing.push(mask);
                } else {
                    let len = batch.num_rows();
                    let buffer = BooleanBuffer::new_set(len);
                    let mask = BooleanArray::new(buffer, None);
                    total_selected += len;
                    passing.push(mask);
                }
            }
            Some(Err(e)) => return Err(e),
            None => break,
        }
    }
    let arrays: Vec<ArrayRef> = passing
        .into_iter()
        .map(|b| Arc::new(b) as ArrayRef)
        .collect();

    let combined = concat(&arrays).unwrap();
    let boolean_combined = combined
        .as_any()
        .downcast_ref::<BooleanArray>()
        .unwrap()
        .clone();

    Ok(boolean_combined)
}

alamb · 2025-05-13T14:25:08Z

Thanks @zhuqi-lucas -- I have some ideas I will try out later today

alamb · 2025-05-13T19:55:47Z

I have been studying the code today -- I have a good idea on what I want to try

zhuqi-lucas · 2025-05-14T09:17:57Z

Thank you @alamb , i submitted a very draft PR today:

#7503

Need to polish:

Now i only output the project columns which excluding the filter columns, we need to cache the filer columns result which will also output for the final output.
We need to emit the boolean array vector, now i just emit the vector < Rowselection >
We need to also support adaptive for the final emit using the emit booleanarray/Rowselection.
More corner cases and testing, etc

alamb · 2025-05-16T12:34:32Z

parquet/src/arrow/arrow_reader/mod.rs

            }
+            Some(RowSelection::BitMap(bitmap)) => {


BTW I hope to reuse some/all of this code (so it can iterate based on BitMap or RowSelection)

My idea is to switch this code to use a different structure than RowSelection (something like ResolvedRowSelection)

This control flow I think is very similar to what @tustvold describes in #5523

The remaining open question in my mind is what heuristics to use to decide when to use RowSelection/ranges and when to use BitMaps.

Thank you @alamb , i think the first initial adaptive case is that if each select/skip is very small and dense, for example < 10, we should use bitmap from testing result. I can do more test based on your read plan PR wit cache merged.

if total < 10 * select_count { // Bitmap branch let bitmap = self.create_bitmap_from_ranges(&runs); match self.array_reader.read_records(bitmap.len()) { Ok(_) => {} Err(e) => return Some(Err(e.into())), }; mask_builder.append_buffer(bitmap.values()); rows_accum += bitmap.true_count(); }

XiangpengHao and others added 30 commits January 7, 2025 10:17

update

cc6dd14

update

5837fc7

update

fec6313

update

948db87

poc reader

8c50d90

update

f5422ce

avoid recreating new buffers

dfdc1b6

update

3c526f8

bug fix

53f5fad

selective cache

56980de

clean up changes

4dd1b6b

clean up more and format

f8f983e

cleanup and add docs

882aaf1

switch to mutex instead of rwlock

c8bdbcf

revert irrelevant changes

cdb1d85

submodule

69720e5

update

a9550ab

rebase

be1435f

Merge remote-tracking branch 'upstream/main' into better-decoder

e4d9eb7

remove unrelated changes

21e015b

Merge remote-tracking branch 'upstream/main' into better-decoder

bbc3595

fix clippy

547fb46

make various ci improvements

05c8c8f

Merge remote-tracking branch 'apache/main' into better-decoder

314fda1

whitespace

c895dd2

Reduce some ugliness, avoid unwrap

3cf0a98

more factory

7b72f9d

lint

5bdf51a

Merge remote-tracking branch 'apache/main' into better-decoder

a77e1e7

Isolate reader cache more

90a55d5

Fix 3 data page testing

ceceb8e

zhuqi-lucas changed the title ~~Draft Poc for adaptive parquet predicate pushdown(bitmap/range) with page cache(3 data pages)~~ Poc for adaptive parquet predicate pushdown(bitmap/range) with page cache(3 data pages) May 11, 2025

zhuqi-lucas marked this pull request as ready for review May 11, 2025 14:42

Fix encrption error handling for new page cache logic

bc02e2a

Update parquet testing

a4065cc

Clean up code

269b396

alamb reviewed May 12, 2025

View reviewed changes

alamb mentioned this pull request May 12, 2025

[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

Open

7 tasks

alamb mentioned this pull request May 12, 2025

Add support for file row numbers in Parquet readers #7307

Open

alamb mentioned this pull request May 13, 2025

Introduce ReadPlan to encapsulate the calculation of what parquet rows to decode #7502

Merged

alamb mentioned this pull request May 14, 2025

Draft POC Unified filter decoder #7503

Draft

alamb reviewed May 16, 2025

View reviewed changes

This was referenced May 21, 2025

POC Adaptive predicate push down based read plan #7524

Open

Move Selection logic into ReadPlan builder #7537

Draft

Poc for adaptive parquet predicate pushdown(bitmap/range) with page cache(3 data pages) #7454

Are you sure you want to change the base?

Poc for adaptive parquet predicate pushdown(bitmap/range) with page cache(3 data pages) #7454

Uh oh!

Conversation

zhuqi-lucas commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

zhuqi-lucas commented May 11, 2025

Uh oh!

alamb commented May 12, 2025

Uh oh!

zhuqi-lucas commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented May 12, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented May 12, 2025

Uh oh!

alamb commented May 12, 2025

Uh oh!

alamb commented May 12, 2025

Uh oh!

zhuqi-lucas commented May 12, 2025

Uh oh!

alamb commented May 12, 2025

Uh oh!

alamb commented May 12, 2025

Uh oh!

zhuqi-lucas commented May 12, 2025

Uh oh!

alamb commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuqi-lucas commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented May 13, 2025

Uh oh!

alamb commented May 13, 2025

Uh oh!

zhuqi-lucas commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhuqi-lucas commented Apr 29, 2025 •

edited

Loading

zhuqi-lucas commented May 12, 2025 •

edited

Loading

alamb commented May 12, 2025 •

edited

Loading

zhuqi-lucas commented May 13, 2025 •

edited

Loading

zhuqi-lucas commented May 14, 2025 •

edited

Loading