[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

alamb · 2025-04-29T15:05:05Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Related to Enable parquet filter pushdown by default datafusion#3463 in DataFusion.

When evaluating filters on data stored in parquet, you can:

Use the with_row_filter API to apply predicates during the scan
Read the data and apply the predicate using the filter kernel afterwards

Currently, it is faster to use with_row_filter for some predicates and filter for others. In DataFusion we have a configuration setting to choose between the strategies (filter_pushdown, see apache/datafusion#3463) but that is a bad UX as it
means the user must somehow know which strategy to choose, but the strategy changes

In general the queries that are slower when with_row_filter is used:

The predicates are not very selective (e.g. they pass more than 1% of the rows)
The filters are applied to columns which are also used in the query result (e.g. the a filter column is also in the projection)

More Background:

The predicates are provides as a RowFilter (see docs for more details)

RowFilter applies predicates in order, after decoding only the columns required. As predicates eliminate rows, fewer rows from subsequent columns may be required, thus potentially reducing IO and decode.

Describe the solution you'd like

I would like the evaluation of predicates in RowFilter (aka pushed down predicates) to never be worse than decoding the columns first and then filtering them with the filter kernel

We have added a benchmark #7401, which hopefully can

cargo bench --all-features --bench arrow_reader_row_filter

Describe alternatives you've considered
This goal will likely require several changes to the codebase. Here are some options:

The text was updated successfully, but these errors were encountered:

alamb · 2025-04-30T18:57:54Z

I just spoke with @XiangpengHao -- from my perspective the current status is:

Parquet decoder / decoded page Cache #7363: blocked on getting some benchmark results that show the decoded page cache improves performance; Then we can proceed / merge the page cache change
In paralell / then we can move on to working on a better representation for RowFilter (Adaptive Parquet Predicate Pushdown Evaluation #5523 / Consider removing skip from RowSelector #7450 / RowSelection::and_then is slow #7458)

alamb · 2025-05-08T17:25:37Z

Fascinatingly, Clickbench released a blog post recently about their parquet pushdown work

https://clickhouse.com/blog/clickhouse-and-parquet-a-foundation-for-fast-lakehouse-analytics

Possibly even more interesting is that they link to a master's thesis from Peter Boncz's group about how to quickly evaluate predicates during Parquet Decoding: https://homepages.cwi.nl/~boncz/msc/2018-BoudewijnBraams.pdf

This thesis directly addresses some of the work we are considering (though they only consider Selection masks (bitmask) and Selection Vector (selected indices)

alamb · 2025-05-12T13:46:32Z

@zhuqi-lucas has a great insight in #7454 -- namely that instead of a two pass algorithm (evaluate RowFilter to form a final RowSelection and then re-decode the filter) we can combine the filter application and decode steps (see #7454 (review))

The current flow goes something like:

A set of array readers is created for the filter columns, and uses the provided RowSelection (this captures prunning
pages ).
The decoded batches are used to evaluate the RowFilter / ArrowPredicates, which produces a BooleanArray bitmap
The "final" RowSelection is created, by unioning the existing RowSelection with the BooleanArrays
A new set of array readers is created with the updated RowSelection

The current PR starts heading down a slightly modified flow, where the RowSelection and RowFilters are not combined.

I think a combined solution would look something like:

Create Decoders for filter columns and projection (only) columns

Decoding proceeds like:

read rows from initial RowSelection (reads a 8192 rows) from filter columns, if any
Apply any RowFilters on it (produces a BooleanArray)
repeat 1-2 until there are at least 8192 (batch size) rows that pass the filter. (This means we have Vec<BooleanArray> with 8192 1s and a Vec for each filter column that is also a projection column)
Then decode as maby RecordBatches from the projection (only) columns using the initial RowSelection)
Apply the filters to each array to form the final output batch (in projection columns)

alamb · 2025-05-16T15:20:31Z

@zhuqi-lucas and I have been working on various strategies / structures to make the filtering faster. I believe we now have evidence enough to proceed with a more sophisticated implementation

Specifically,

@zhuqi-lucas has shown the hybrid Filter/RowSelection approach works well in Poc for adaptive parquet predicate pushdown(bitmap/range) with page cache(3 data pages) #7454
I have shown the idea of reusing actual filter results works well in POC: Sketch out cached filter result API #7513 (though I still need to work out how to limit memory usage more, potentially adaptively)

Thus my next steps will be:

Create a few refactoring PRs that gets the predicate code into shape

Perhaps then @zhuqi-lucas can help port the hybrid Filter/RowSelection to the ReadPlan to get better performance without changing any public interfaces

zhuqi-lucas · 2025-05-16T15:31:31Z

Thank you @alamb , i will help port the hybrid Filter/RowSelection to the ReadPlan to get better performance without changing any public interfaces and also testing, review.

alamb · 2025-05-16T15:37:35Z

Thank you @alamb , i will help port the hybrid Filter/RowSelection to the ReadPlan to get better performance without changing any public interfaces and also testing, review.

Thank you so much @zhuqi-lucas -- it is great working with you

zhuqi-lucas · 2025-05-17T14:06:32Z

Thank you @alamb , i tried the draft POC for adaptive predicate pushdown based read plan PR today, it shows good result:

#7524 (comment)

alamb · 2025-05-20T21:11:34Z

Update: I am quite happy with how reusing filtered results is coming along, see:

POC: Sketch out cached filter result API #7513

I also had some thoughts on reducing the buffering required here: #6692 (comment)

alamb · 2025-05-21T14:09:50Z

Status update

The high level plan to improve performance has two parts:

adaptive iteration / representation of filter results (basically Adaptive Parquet Predicate Pushdown Evaluation #5523)
Caching the results of filtering when the column is used in the final projection (basically improve: reuse Arc<dyn Array> in parquet record batch reader. #4864)

My main concern about resuing the result of filtering is memory usage and I think it is important to keep the usage to a minimum -- the current APIs (filter and concat kernels) require a 2x memory overhead so I think it is important to reduce that as well as add some way to limit memory consumption when the filtering result

We also need to implement more sophisticated logic when there are multiple predicates.

My next steps are:

Try and factor the adaptive representation of filter results with @zhuqi-lucas
Explore ways to reduce the memory overhead with caching results (this should help other APIs too).

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Apr 29, 2025

alamb mentioned this issue Apr 29, 2025

Consider removing skip from RowSelector #7450

Open

alamb added the parquet Changes to the parquet crate label Apr 29, 2025

This was referenced Apr 30, 2025

Adaptive Parquet Predicate Pushdown Evaluation #5523

Open

arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460

Closed

This was referenced May 5, 2025

Add arrow_reader_clickbench benchmark #7470

Merged

Introduce selection vector repartitioning apache/datafusion#15423

Open

This was referenced May 16, 2025

Refactor build_array_reader into a struct #7521

Merged

Minor: Add a parquet row_filter test, reduce some test boiler plate #7522

Merged

Minor: Add examples to ProjectionMask documentation #7523

Merged

alamb self-assigned this May 19, 2025

alamb mentioned this issue May 20, 2025

Optimize take/filter/concat from multiple input arrays to a single large output array #6692

Open

alamb mentioned this issue May 21, 2025

improve: reuse Arc<dyn Array> in parquet record batch reader. #4864

Open

alamb mentioned this issue May 22, 2025

Blog post about parquet vs custom file formats apache/datafusion#16149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

alamb commented Apr 29, 2025 •

edited

Loading

alamb commented Apr 30, 2025

Uh oh!

alamb commented May 8, 2025

Uh oh!

alamb commented May 12, 2025

Uh oh!

alamb commented May 16, 2025

Uh oh!

zhuqi-lucas commented May 16, 2025

Uh oh!

alamb commented May 16, 2025 •

edited

Loading

Uh oh!

zhuqi-lucas commented May 17, 2025

Uh oh!

alamb commented May 20, 2025

Uh oh!

alamb commented May 21, 2025

Uh oh!

[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

Comments

alamb commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

More Background:

Describe the solution you'd like

alamb commented Apr 30, 2025

Uh oh!

alamb commented May 8, 2025

Uh oh!

alamb commented May 12, 2025

Uh oh!

alamb commented May 16, 2025

Uh oh!

zhuqi-lucas commented May 16, 2025

Uh oh!

alamb commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuqi-lucas commented May 17, 2025

Uh oh!

alamb commented May 20, 2025

Uh oh!

alamb commented May 21, 2025

Uh oh!

alamb commented Apr 29, 2025 •

edited

Loading

alamb commented May 16, 2025 •

edited

Loading