Binary columns do not receive truncated statistics

**Describe the bug**
#4389 introduced truncation on column indices for binary columns, where the min/max values for a binary column may be arbitrarily large. As noted, this matches the behaviour in parquet-mr for shortening columns.

However, the value in the statistics is written un-truncated. This differs from the behaviour of parquet-mr where the statistics are truncated too: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L715

**To Reproduce**
There is a test in https://github.com/delta-io/delta-rs/issues/1805 which demonstrates this, but in general write a parquet file with a long binary column and observe that the stats for that column are not truncated.

**Expected behavior**
Matching parquet-mr, the statistics should be truncated as well.

**Additional context**
Found this when looking into https://github.com/delta-io/delta-rs/issues/1805. delta-rs uses the column stats to serialize into the delta log, which leads to very bloated entries.

I think it is sufficient to just call truncate_min_value/truncate_max_value when creating the column metadata here: https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L858-L859 but I don't know enough about the internals of arrow to know if that change is correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Binary columns do not receive truncated statistics #5037

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Binary columns do not receive truncated statistics #5037

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions