-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[parquet] Add row group index virtual column #9117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[parquet] Add row group index virtual column #9117
Conversation
c8e3eba to
073e3d5
Compare
073e3d5 to
0c9e12d
Compare
|
@Jefffrey if you have some bandwidth, I'd be curious to get your thoughts |
|
cc @alamb |
I'll see if I can take a look at this PR soon, though I will say it's been a while since I looked at the parquet codebase 😅 |
|
This would be sweet! I like the idea of using an extension type as a marker, and letting the caller customize the column name. |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! (a few nits)
| row_groups: impl Iterator<Item = &'a RowGroupMetaData>, | ||
| ) -> Result<Self> { | ||
| // build mapping from ordinal to row group index | ||
| // this is O(M) where M is the total number of row groups in the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would expect m < n, so this is O(n) where n is the total row groups, and the loop below is O(m) where m is the number of selected row groups?
| fn read_records(&mut self, batch_size: usize) -> Result<usize> { | ||
| let starting_len = self.buffered_indices.len(); | ||
| self.buffered_indices | ||
| .extend((&mut self.remaining_indices).take(batch_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I know this is how the row index reader did it, but since that code merged I learned that Iterator::by_ref is a thing.
| .extend((&mut self.remaining_indices).take(batch_size)); | |
| .extend(self.remaining_indices.by_ref().take(batch_size)); |
It's not shorter, but does seem more readable?
(more below)
| if metadata.is_some_and(str::is_empty) { | ||
| Ok("") | ||
| } else { | ||
| Err(ArrowError::InvalidArgumentError( | ||
| "Virtual column extension type expects an empty string as metadata".to_owned(), | ||
| )) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is a match simpler?
| if metadata.is_some_and(str::is_empty) { | |
| Ok("") | |
| } else { | |
| Err(ArrowError::InvalidArgumentError( | |
| "Virtual column extension type expects an empty string as metadata".to_owned(), | |
| )) | |
| } | |
| match metadata { | |
| Some(&"") => Ok(""), | |
| _ => Err(ArrowError::InvalidArgumentError( | |
| "Virtual column extension type expects an empty string as metadata".to_owned(), | |
| )), | |
| } |
or even
| if metadata.is_some_and(str::is_empty) { | |
| Ok("") | |
| } else { | |
| Err(ArrowError::InvalidArgumentError( | |
| "Virtual column extension type expects an empty string as metadata".to_owned(), | |
| )) | |
| } | |
| if let Some(&"") = metadata { | |
| return Ok(""); | |
| }; | |
| Err(ArrowError::InvalidArgumentError( | |
| "Virtual column extension type expects an empty string as metadata".to_owned(), | |
| )) |
Which issue does this PR close?
Rationale for this change
This PR adds support for storing row group indices as a virtual column, allowing users to determine which row group each row originated from
The usage pattern is quite simple, something like: