Skip to content

fix: Zstd canonicalize with >2GB decompressed buffers#6989

Merged
robert3005 merged 1 commit intovortex-data:developfrom
SumedhArani:zstd-canonical
Mar 17, 2026
Merged

fix: Zstd canonicalize with >2GB decompressed buffers#6989
robert3005 merged 1 commit intovortex-data:developfrom
SumedhArani:zstd-canonical

Conversation

@SumedhArani
Copy link
Contributor

@SumedhArani SumedhArani commented Mar 16, 2026

Fixes a panic in reconstruct_views when a ZstdArray's decompressed string data exceeds ~2 GiB. The function builds BinaryView structs that use u32 offsets into the decompressed buffer; when the buffer is larger than i32::MAX the offset cast panics.

This is the same class of bug fixed for VarBin/FSST in #5961, but Zstd was not addressed because its decompressed buffer interleaves u32 length prefixes with string data (unlike VarBin/FSST which have a separate lengths array).

The fix splits the decompressed buffer at value boundaries (zero-copy via ByteBuffer::slice) when approaching i32::MAX, using BinaryView's native buffer_index field to reference the correct segment. The i32::MAX limit matches the convention established in #5961 per the Arrow spec that BinaryView offsets are logically signed. The existing pub const MAX_BUFFER_LEN from vortex_array::arrays::varbinview::build_views is reused rather than redefining the limit locally.

Signed-off-by: Sumedh Arani sumedh@langchain.dev

@SumedhArani SumedhArani force-pushed the zstd-canonical branch 2 times, most recently from 22d91ed to 61913f6 Compare March 16, 2026 22:27
@SumedhArani
Copy link
Contributor Author

@a10y, would appreciate if you could take a look at this? Thanks!

Fixes a panic in `reconstruct_views` when a ZstdArray's decompressed
string data exceeds ~2 GiB. The function builds BinaryView structs
that use u32 offsets into the decompressed buffer; when the buffer
is larger than i32::MAX the offset cast panics.

This is the same class of bug fixed for VarBin/FSST in vortex-data#5961, but
Zstd was not addressed because its decompressed buffer interleaves
u32 length prefixes with string data (unlike VarBin/FSST which have
a separate lengths array).

The fix splits the decompressed buffer at value boundaries (zero-copy
via `ByteBuffer::slice`) when approaching i32::MAX, using BinaryView's
native `buffer_index` field to reference the correct segment. The
i32::MAX limit matches the convention established in vortex-data#5961 per the
Arrow spec that BinaryView offsets are logically signed.

Signed-off-by: Sumedh Arani <sumedh@langchain.dev>
@a10y a10y added the changelog/fix A bug fix label Mar 16, 2026
@robert3005 robert3005 merged commit b825593 into vortex-data:develop Mar 17, 2026
59 of 60 checks passed
@SumedhArani SumedhArani deleted the zstd-canonical branch March 17, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants