fix: Zstd canonicalize with >2GB decompressed buffers#6989
Merged
robert3005 merged 1 commit intovortex-data:developfrom Mar 17, 2026
Merged
fix: Zstd canonicalize with >2GB decompressed buffers#6989robert3005 merged 1 commit intovortex-data:developfrom
robert3005 merged 1 commit intovortex-data:developfrom
Conversation
22d91ed to
61913f6
Compare
Contributor
Author
|
@a10y, would appreciate if you could take a look at this? Thanks! |
a10y
approved these changes
Mar 16, 2026
Fixes a panic in `reconstruct_views` when a ZstdArray's decompressed string data exceeds ~2 GiB. The function builds BinaryView structs that use u32 offsets into the decompressed buffer; when the buffer is larger than i32::MAX the offset cast panics. This is the same class of bug fixed for VarBin/FSST in vortex-data#5961, but Zstd was not addressed because its decompressed buffer interleaves u32 length prefixes with string data (unlike VarBin/FSST which have a separate lengths array). The fix splits the decompressed buffer at value boundaries (zero-copy via `ByteBuffer::slice`) when approaching i32::MAX, using BinaryView's native `buffer_index` field to reference the correct segment. The i32::MAX limit matches the convention established in vortex-data#5961 per the Arrow spec that BinaryView offsets are logically signed. Signed-off-by: Sumedh Arani <sumedh@langchain.dev>
61913f6 to
520825d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes a panic in
reconstruct_viewswhen aZstdArray's decompressed string data exceeds ~2 GiB. The function buildsBinaryViewstructs that useu32offsets into the decompressed buffer; when the buffer is larger thani32::MAXthe offset cast panics.This is the same class of bug fixed for VarBin/FSST in #5961, but Zstd was not addressed because its decompressed buffer interleaves
u32length prefixes with string data (unlike VarBin/FSST which have a separate lengths array).The fix splits the decompressed buffer at value boundaries (zero-copy via
ByteBuffer::slice) when approachingi32::MAX, usingBinaryView's nativebuffer_indexfield to reference the correct segment. Thei32::MAXlimit matches the convention established in #5961 per the Arrow spec thatBinaryViewoffsets are logically signed. The existingpub const MAX_BUFFER_LENfromvortex_array::arrays::varbinview::build_viewsis reused rather than redefining the limit locally.Signed-off-by: Sumedh Arani sumedh@langchain.dev