Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Raise an error when reading Parquet data with invalid repetition levels #45185

Closed
adamreeve opened this issue Jan 7, 2025 · 1 comment

Comments

@adamreeve
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

When looking into #45073 I found that Arrow doesn't raise an error when reading data with invalid repetition levels into Arrow list arrays.

The encryption test files included an int64 list column with leaf-values equal to i * 1,000,000,000,000, where i is the leaf-value index. The repetition level was set to 1 for even leaf indices and 0 for odd indices, meaning the first repetition level was 1 which is invalid. This file is read by PyArrow without any error being raised though, and the first leaf value (0) is skipped:

pyarrow.Table
int64_field: list<int64_field: int64 not null> not null
  child 0, int64_field: int64 not null
----
int64_field: [[[1000000000000,2000000000000],[3000000000000,4000000000000],...,[97000000000000,98000000000000],[99000000000000]]]

I wouldn't expect an error to be raised if reading the raw values and repetition levels with the lower-level Parquet C++ API, but think reading this data as an Arrow list should raise an error.

Component(s)

C++, Parquet

@adamreeve adamreeve self-assigned this Jan 7, 2025
mapleFU added a commit that referenced this issue Mar 28, 2025
… when delimiting records (#45186)

### Rationale for this change

See #45185. Invalid repetition levels would previously only cause a fatal error in debug builds. 

### What changes are included in this PR?

Converts an existing `ARROW_DCHECK_EQ` of the repetition level with a check that will raise an exception in release builds too.

### Are these changes tested?

Yes, using a new example file (apache/parquet-testing#67)

### Are there any user-facing changes?

Yes, reading columns with invalid repetition levels as Arrow arrays will now raise an exception.
* GitHub Issue: #45185

Lead-authored-by: Adam Reeve <[email protected]>
Co-authored-by: mwish <[email protected]>
Signed-off-by: mwish <[email protected]>
@mapleFU mapleFU added this to the 20.0.0 milestone Mar 28, 2025
@mapleFU
Copy link
Member

mapleFU commented Mar 28, 2025

Issue resolved by pull request 45186
#45186

@mapleFU mapleFU closed this as completed Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants