-
Notifications
You must be signed in to change notification settings - Fork 927
Add support for file row numbers in Parquet readers #7307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jkylling
wants to merge
9
commits into
apache:main
Choose a base branch
from
jkylling:feature/parquet-reader-row-numbers
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
f93d36e
Add support for file row numbers in Parquet readers
jkylling e485c0b
Add Apache license header to row_number.rs
jkylling 2a62009
Run cargo format
jkylling fb5126f
Change with_row_number_column to take impl Into<String>
jkylling 5350728
Change Option<String> -> Option<&str> in build_array_reader
jkylling 188f350
Replace ParquetError::RowGroupMetaDataMissingRowNumber with General
jkylling 37a9d83
Split test_create_array_reader test into two
jkylling 41e38fe
first_row_number -> first_row_index
jkylling 1a1e6b6
Simplify RowNumberReader with iterators
jkylling File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
// Licensed to the Apache Software Foundation (ASF) under one | ||
// or more contributor license agreements. See the NOTICE file | ||
// distributed with this work for additional information | ||
// regarding copyright ownership. The ASF licenses this file | ||
// to you under the Apache License, Version 2.0 (the | ||
// "License"); you may not use this file except in compliance | ||
// with the License. You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, | ||
// software distributed under the License is distributed on an | ||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
// KIND, either express or implied. See the License for the | ||
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
use crate::arrow::array_reader::ArrayReader; | ||
use crate::errors::{ParquetError, Result}; | ||
use crate::file::metadata::RowGroupMetaData; | ||
use arrow_array::{ArrayRef, Int64Array}; | ||
use arrow_schema::DataType; | ||
use std::any::Any; | ||
use std::sync::Arc; | ||
|
||
pub(crate) struct RowNumberReader { | ||
buffered_row_numbers: Vec<i64>, | ||
remaining_row_numbers: std::iter::Flatten<std::vec::IntoIter<std::ops::Range<i64>>>, | ||
} | ||
|
||
impl RowNumberReader { | ||
pub(crate) fn try_new<'a>( | ||
row_groups: impl Iterator<Item = &'a RowGroupMetaData>, | ||
) -> Result<Self> { | ||
let ranges = row_groups | ||
.map(|rg| { | ||
let first_row_number = rg.first_row_index().ok_or(ParquetError::General( | ||
"Row group missing row number".to_string(), | ||
))?; | ||
Ok(first_row_number..first_row_number + rg.num_rows()) | ||
}) | ||
.collect::<Result<Vec<_>>>()?; | ||
Ok(Self { | ||
buffered_row_numbers: Vec::new(), | ||
remaining_row_numbers: ranges.into_iter().flatten(), | ||
}) | ||
} | ||
} | ||
|
||
impl ArrayReader for RowNumberReader { | ||
fn read_records(&mut self, batch_size: usize) -> Result<usize> { | ||
let starting_len = self.buffered_row_numbers.len(); | ||
self.buffered_row_numbers | ||
.extend((&mut self.remaining_row_numbers).take(batch_size)); | ||
Ok(self.buffered_row_numbers.len() - starting_len) | ||
} | ||
|
||
fn skip_records(&mut self, num_records: usize) -> Result<usize> { | ||
// TODO: Use advance_by when it stabilizes to improve performance | ||
Ok((&mut self.remaining_row_numbers).take(num_records).count()) | ||
} | ||
|
||
fn as_any(&self) -> &dyn Any { | ||
self | ||
} | ||
|
||
fn get_data_type(&self) -> &DataType { | ||
&DataType::Int64 | ||
} | ||
|
||
fn consume_batch(&mut self) -> Result<ArrayRef> { | ||
Ok(Arc::new(Int64Array::from_iter( | ||
self.buffered_row_numbers.drain(..), | ||
))) | ||
} | ||
|
||
fn get_def_levels(&self) -> Option<&[i16]> { | ||
None | ||
} | ||
|
||
fn get_rep_levels(&self) -> Option<&[i16]> { | ||
None | ||
} | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a crazy idea, but wouldn't the implementation be simpler (and more flexible) with a
RowNumber
extension type? Then users could do e.g.and
build_primitive_reader
could just check for it, no matter where in the schema it hides, instead of implicitly adding an extra column to the schema?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I don't think raw parquet types support metadata, so this may not be an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would simplify usage of the feature. Having to keep track of the additional row number column is quite cumbersome in clients of this API. One option could be to extend ParquetFieldType with an additional row number type and add it based on the extension type in ArrowReaderMetadata::with_supplied_metadata? @etseidl @alamb what do you think about this approach?