Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR adds support for storing row group indices as a virtual column, allowing users to determine which row group each row originated from

The usage pattern is quite simple, something like:

use parquet::arrow::RowGroupIndex;

let row_group_index_field = Arc::new(
    Field::new("row_group_index", DataType::Int64, false)
        .with_extension_type(RowGroupIndex)
);

let options = ArrowReaderOptions::new()
    .with_virtual_columns(vec![row_group_index_field])?;

let reader = ParquetRecordBatchReaderBuilder::try_new_with_options(file, options)?
    .build()?;

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 8, 2026
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/virtual-row-group-index branch 3 times, most recently from c8e3eba to 073e3d5 Compare January 8, 2026 16:04
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/virtual-row-group-index branch from 073e3d5 to 0c9e12d Compare January 8, 2026 16:08
@friendlymatthew
Copy link
Contributor Author

@Jefffrey if you have some bandwidth, I'd be curious to get your thoughts

@friendlymatthew
Copy link
Contributor Author

cc @alamb

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 9, 2026

@Jefffrey if you have some bandwidth, I'd be curious to get your thoughts

I'll see if I can take a look at this PR soon, though I will say it's been a while since I looked at the parquet codebase 😅

@adriangb
Copy link
Contributor

adriangb commented Jan 9, 2026

This would be sweet! I like the idea of using an extension type as a marker, and letting the caller customize the column name.

@alamb
Copy link
Contributor

alamb commented Jan 9, 2026

I think the extension type marker was pioneered by @jkylling and @vustef

I will also try and review this sooner rather than later

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! (a few nits)

row_groups: impl Iterator<Item = &'a RowGroupMetaData>,
) -> Result<Self> {
// build mapping from ordinal to row group index
// this is O(M) where M is the total number of row groups in the file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would expect m < n, so this is O(n) where n is the total row groups, and the loop below is O(m) where m is the number of selected row groups?

fn read_records(&mut self, batch_size: usize) -> Result<usize> {
let starting_len = self.buffered_indices.len();
self.buffered_indices
.extend((&mut self.remaining_indices).take(batch_size));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know this is how the row index reader did it, but since that code merged I learned that Iterator::by_ref is a thing.

Suggested change
.extend((&mut self.remaining_indices).take(batch_size));
.extend(self.remaining_indices.by_ref().take(batch_size));

It's not shorter, but does seem more readable?

(more below)

Comment on lines +51 to +57
if metadata.is_some_and(str::is_empty) {
Ok("")
} else {
Err(ArrowError::InvalidArgumentError(
"Virtual column extension type expects an empty string as metadata".to_owned(),
))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is a match simpler?

Suggested change
if metadata.is_some_and(str::is_empty) {
Ok("")
} else {
Err(ArrowError::InvalidArgumentError(
"Virtual column extension type expects an empty string as metadata".to_owned(),
))
}
match metadata {
Some(&"") => Ok(""),
_ => Err(ArrowError::InvalidArgumentError(
"Virtual column extension type expects an empty string as metadata".to_owned(),
)),
}

or even

Suggested change
if metadata.is_some_and(str::is_empty) {
Ok("")
} else {
Err(ArrowError::InvalidArgumentError(
"Virtual column extension type expects an empty string as metadata".to_owned(),
))
}
if let Some(&"") = metadata {
return Ok("");
};
Err(ArrowError::InvalidArgumentError(
"Virtual column extension type expects an empty string as metadata".to_owned(),
))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet] Support returning RowGroupIndex as a column

5 participants