Parquet: Add ability to project rowid in parquet reader #7444

thinkharderdev · 2025-04-26T13:05:06Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Add a method

impl ArrowReaderBuilder<T> { 
   pub fn with_rowid(self, field_name: impl Into<String>) -> Self {...}
}

that, will project a column with name field_name into the output of the reader that contains the row offset in the parquet file of each row

Describe the solution you'd like

Prototype implementation can be found here coralogix@3d4a09f

If this seems like something we can merge upstream I can create a PR to master in the upstream repo

Describe alternatives you've considered

Not do it :)

Additional context

I'm trying to implement something like https://clickhouse.com/blog/clickhouse-gets-lazier-and-faster-introducing-lazy-materialization in a way that does not require re-scanning metadata or re-scanning fields that have already been read and decoded.

The basic idea is that you have a parquet file with some projections and a TopK sort on some (ideally small) subset of those projections. So you can:

Read the columns required for the topk sort along with their row offsets
Build the topk and discard everything else
Use the rowids from the topk rows to build a RowSelection to read remaining columns
Read remaining columns using row selection.

The current implementation of parquet reader can't support this if you have row filters you are pushing down to the scan since the offset of rows from the scan in 1 will not align with the offset of rows in the file.

But it is relatively straightforward to keep track of the offsets during scan and just return them.

The text was updated successfully, but these errors were encountered:

etseidl · 2025-04-28T20:28:30Z

Is this related to #7307?

thinkharderdev · 2025-04-28T21:41:55Z

Is this related to #7307?

Looks like it's the same thing, didn't see that one

thinkharderdev · 2025-04-28T21:42:18Z

Duplicate of #7299

thinkharderdev added the enhancement Any new improvement worthy of a entry in the changelog label Apr 26, 2025

thinkharderdev marked this as a duplicate of #7299 Apr 28, 2025

thinkharderdev closed this as completed Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: Add ability to project rowid in parquet reader #7444

Parquet: Add ability to project rowid in parquet reader #7444

thinkharderdev commented Apr 26, 2025

etseidl commented Apr 28, 2025

thinkharderdev commented Apr 28, 2025

thinkharderdev commented Apr 28, 2025

Parquet: Add ability to project rowid in parquet reader #7444

Parquet: Add ability to project rowid in parquet reader #7444

Comments

thinkharderdev commented Apr 26, 2025

etseidl commented Apr 28, 2025

thinkharderdev commented Apr 28, 2025

thinkharderdev commented Apr 28, 2025