Skip to content

Return file row number in Parquet readers #7299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jkylling opened this issue Mar 16, 2025 · 2 comments · May be fixed by #7307
Open

Return file row number in Parquet readers #7299

jkylling opened this issue Mar 16, 2025 · 2 comments · May be fixed by #7307
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@jkylling
Copy link

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Deletion vectors in the Delta Lake and Iceberg table formats are defined in terms of row numbers within individual Parquet files. To be able to filter out rows defined as deleted by deletion vectors we need a way to know the file row number of the rows read by the Arrow Parquet reader.

Describe the solution you'd like

The Arrow Parquet reader should optionally return a column containing the row number of each row. We add a method ArrowReaderBuilder::with_row_numbers(self, with_row_numbers: bool) -> Self, which configures the Arrow Parquet reader to add an extra column named row_number to its schema (possibly the method could be ArrowReaderBuilder::with_row_number_column(self, with_row_numbers: Option<String>) -> Self to make the column name configurable). This column contains the row number within the file.

Describe alternatives you've considered

There is a corresponding issue on Datafusion apache/datafusion#13261. It considers an alternative using primary keys and existing SQL primitives, but this comes with a performance penalty. The consensus on the issue is

I agree with the assessment that the information must be coning from the file reader itself.

That is, the Arrow Parquet reader.

Additional context

Please see apache/datafusion#13261 for the corresponding issue in Datafusion. There is also a discussion in Datafusion to add system/metadata columns in apache/datafusion#14057 through which this additional file row number column could be exposed. Though, we do not need system/metadata columns to be available to support deletion vectors in delta-rs or iceberg-rs, since the delta-rs and iceberg-rs Datafusion based readers use the Datafusion ParquetSource directly to construct the execution plans for the scans of their TableProviders.

@jkylling jkylling added the enhancement Any new improvement worthy of a entry in the changelog label Mar 16, 2025
@alamb
Copy link
Contributor

alamb commented Mar 18, 2025

I think adding this to the reader seems reasonable to me if there is a way to:

  1. Opt in (don't slow down reading if the row number isn't needed)
  2. the API is reasonable / doesn't make the code "too" complicated (I realize this is a subjective judgement)

@jkylling jkylling linked a pull request Mar 18, 2025 that will close this issue
@jkylling
Copy link
Author

I think adding this to the reader seems reasonable to me if there is a way to:

  1. Opt in (don't slow down reading if the row number isn't needed)
  2. the API is reasonable / doesn't make the code "too" complicated (I realize this is a subjective judgement)

I've started on this in #7307. Please let me know if you think the approach is reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants