Skip to content

Parquet: Add ability to project rowid in parquet reader #7444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thinkharderdev opened this issue Apr 26, 2025 · 3 comments
Closed

Parquet: Add ability to project rowid in parquet reader #7444

thinkharderdev opened this issue Apr 26, 2025 · 3 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@thinkharderdev
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Add a method

impl ArrowReaderBuilder<T> { 
   pub fn with_rowid(self, field_name: impl Into<String>) -> Self {...}
}

that, will project a column with name field_name into the output of the reader that contains the row offset in the parquet file of each row

Describe the solution you'd like

Prototype implementation can be found here coralogix@3d4a09f

If this seems like something we can merge upstream I can create a PR to master in the upstream repo

Describe alternatives you've considered

Not do it :)

Additional context

I'm trying to implement something like https://clickhouse.com/blog/clickhouse-gets-lazier-and-faster-introducing-lazy-materialization in a way that does not require re-scanning metadata or re-scanning fields that have already been read and decoded.

The basic idea is that you have a parquet file with some projections and a TopK sort on some (ideally small) subset of those projections. So you can:

  1. Read the columns required for the topk sort along with their row offsets
  2. Build the topk and discard everything else
  3. Use the rowids from the topk rows to build a RowSelection to read remaining columns
  4. Read remaining columns using row selection.

The current implementation of parquet reader can't support this if you have row filters you are pushing down to the scan since the offset of rows from the scan in 1 will not align with the offset of rows in the file.

But it is relatively straightforward to keep track of the offsets during scan and just return them.

@thinkharderdev thinkharderdev added the enhancement Any new improvement worthy of a entry in the changelog label Apr 26, 2025
@etseidl
Copy link
Contributor

etseidl commented Apr 28, 2025

Is this related to #7307?

@thinkharderdev
Copy link
Contributor Author

Is this related to #7307?

Looks like it's the same thing, didn't see that one

@thinkharderdev
Copy link
Contributor Author

Duplicate of #7299

@thinkharderdev thinkharderdev marked this as a duplicate of #7299 Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants