Skip to content

Binary columns do not receive truncated statistics #5037

Closed
@emcake

Description

@emcake

Describe the bug
#4389 introduced truncation on column indices for binary columns, where the min/max values for a binary column may be arbitrarily large. As noted, this matches the behaviour in parquet-mr for shortening columns.

However, the value in the statistics is written un-truncated. This differs from the behaviour of parquet-mr where the statistics are truncated too: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L715

To Reproduce
There is a test in delta-io/delta-rs#1805 which demonstrates this, but in general write a parquet file with a long binary column and observe that the stats for that column are not truncated.

Expected behavior
Matching parquet-mr, the statistics should be truncated as well.

Additional context
Found this when looking into delta-io/delta-rs#1805. delta-rs uses the column stats to serialize into the delta log, which leads to very bloated entries.

I think it is sufficient to just call truncate_min_value/truncate_max_value when creating the column metadata here: https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L858-L859 but I don't know enough about the internals of arrow to know if that change is correct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changeloghelp wantedparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions