Skip to content

Commit fec03ea

Browse files
alambwestonpacekylebarron
authored
Improve documentation on implementing Parquet predicate pushdown (#7370)
* Improve documentation on implementing Parquet predicate pushdown * Apply suggestions from code review Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Kyle Barron <[email protected]> --------- Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Kyle Barron <[email protected]>
1 parent 9a5d821 commit fec03ea

File tree

1 file changed

+38
-1
lines changed
  • parquet/src/arrow/arrow_reader

1 file changed

+38
-1
lines changed

parquet/src/arrow/arrow_reader/mod.rs

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,14 +43,51 @@ mod filter;
4343
mod selection;
4444
pub mod statistics;
4545

46-
/// Builder for constructing parquet readers into arrow.
46+
/// Builder for constructing Parquet readers that decode into [Apache Arrow]
47+
/// arrays.
4748
///
4849
/// Most users should use one of the following specializations:
4950
///
5051
/// * synchronous API: [`ParquetRecordBatchReaderBuilder::try_new`]
5152
/// * `async` API: [`ParquetRecordBatchStreamBuilder::new`]
5253
///
54+
/// # Features
55+
/// * Projection pushdown: [`Self::with_projection`]
56+
/// * Cached metadata: [`ArrowReaderMetadata::load`]
57+
/// * Offset skipping: [`Self::with_offset`] and [`Self::with_limit`]
58+
/// * Row group filtering: [`Self::with_row_groups`]
59+
/// * Range filtering: [`Self::with_row_selection`]
60+
/// * Row level filtering: [`Self::with_row_filter`]
61+
///
62+
/// # Implementing Predicate Pushdown
63+
///
64+
/// [`Self::with_row_filter`] permits filter evaluation *during* the decoding
65+
/// process, which is efficient and allows the most low level optimizations.
66+
///
67+
/// However, most Parquet based systems will apply filters at many steps prior
68+
/// to decoding such as pruning files, row groups and data pages. This crate
69+
/// provides the low level APIs needed to implement such filtering, but does not
70+
/// include any logic to actually evaluate predicates. For example:
71+
///
72+
/// * [`Self::with_row_groups`] for Row Group pruning
73+
/// * [`Self::with_row_selection`] for data page pruning
74+
/// * [`StatisticsConverter`] to convert Parquet statistics to Arrow arrays
75+
///
76+
/// The rationale for this design is that implementing predicate pushdown is a
77+
/// complex topic and varies significantly from system to system. For example
78+
///
79+
/// 1. Predicates supported (do you support predicates like prefix matching, user defined functions, etc)
80+
/// 2. Evaluating predicates on multiple files (with potentially different but compatible schemas)
81+
/// 3. Evaluating predicates using information from an external metadata catalog (e.g. Apache Iceberg or similar)
82+
/// 4. Interleaving fetching metadata, evaluating predicates, and decoding files
83+
///
84+
/// You can read more about this design in the [Querying Parquet with
85+
/// Millisecond Latency] Arrow blog post.
86+
///
5387
/// [`ParquetRecordBatchStreamBuilder::new`]: crate::arrow::async_reader::ParquetRecordBatchStreamBuilder::new
88+
/// [Apache Arrow]: https://arrow.apache.org/
89+
/// [`StatisticsConverter`]: statistics::StatisticsConverter
90+
/// [Querying Parquet with Millisecond Latency]: https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
5491
pub struct ArrowReaderBuilder<T> {
5592
pub(crate) input: T,
5693

0 commit comments

Comments
 (0)