GH-49641: [C++] Fix Lz4HadoopCodec to split large blocks for Hadoop compatibility#49642
GH-49641: [C++] Fix Lz4HadoopCodec to split large blocks for Hadoop compatibility#49642clee704 wants to merge 1 commit intoapache:mainfrom
Conversation
|
|
|
|
|
|
e3502b2 to
74d6b41
Compare
74d6b41 to
d8b71e9
Compare
|
Thanks for submitting this PR. Just for the record: I would not consider this a critical fix, as this is just working around a bug/limitation in another Parquet implementation. |
|
|
pitrou
left a comment
There was a problem hiding this comment.
Thanks! LGTM overall, a suggestion below.
|
By the way @clee704, since this is about generating new files, why not use the newer LZ4_RAW which completely solves the Hadoop compatibility problem? |
d8b71e9 to
97385c8
Compare
|
Thanks for the review @pitrou!
Agreed — updated the description accordingly.
Good suggestion — we're planning to switch our caller to |
…doop compatibility Arrow's Lz4HadoopCodec::Compress writes the entire input as a single Hadoop-framed LZ4 block. Hadoop's Lz4Decompressor allocates a fixed 256 KiB output buffer per block (IO_COMPRESSION_CODEC_LZ4_BUFFERSIZE), so any block whose decompressed size exceeds 256 KiB causes an LZ4Exception on the JVM reader. This is a read failure, not data corruption -- the compressed bytes are valid, but Hadoop-based JVM readers cannot decompress them. Fix: split input into blocks of at most 256 KiB uncompressed, each with its own [decompressed_size, compressed_size] big-endian prefix, matching Hadoop's BlockCompressorStream behavior. Arrow's reader (TryDecompressHadoop) already handles multiple blocks. Closes apache#49641. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
97385c8 to
77ee206
Compare
Rationale
Lz4HadoopCodec::Compresswrites the entire input as a single Hadoop-framed LZ4 block. Hadoop'sLz4Decompressoruses a fixed 256 KiB output buffer per block (IO_COMPRESSION_CODEC_LZ4_BUFFERSIZE_DEFAULT), so blocks decompressing to more than 256 KiB cause anLZ4Exceptionon JVM readers.We hit this when writing Parquet dictionary pages >256 KiB with LZ4 compression. The file was written successfully but a JVM reader (parquet-mr + Hadoop) could not decompress the dictionary page. Note this is a read failure, not data corruption — Arrow's own C++ reader handles the file fine.
PARQUET-1878 added the Hadoop-compatible codec with single-block output. ARROW-11301 updated the reader to handle Hadoop's multi-block format; this PR updates the writer to match.
We're also planning to switch our caller to
LZ4_RAW(which avoids this entirely), but it seemed worth fixingLZ4_HADOOPsince it's a public codec intended for Hadoop compatibility.What changes are included in this PR?
Split input into blocks of ≤ 256 KiB in
Lz4HadoopCodec::Compressand updateMaxCompressedLenfor per-block prefix overhead. Arrow's reader (TryDecompressHadoop) already handles multiple blocks. No behavioral change for data ≤ 256 KiB (still produces a single block, identical output to before).Are these changes tested?
Yes —
TestCodecLZ4Hadoop.MultiBlockRoundtriptests compress→decompress round-trip, block size limits, and MaxCompressedLen sufficiency for sizes from 0 to 1 MiB. The block size check fails without the fix, passes with it.Are there any user-facing changes?
Parquet files written with
LZ4_HADOOPcompression containing pages >256 KiB will now be readable by JVM-based Parquet readers. No change for files with pages ≤ 256 KiB.AI-generated code disclosure
This fix was developed with the assistance of an AI coding assistant (GitHub Copilot). The author has reviewed and verified all changes, including validating the fix with standalone tests that confirm the old code fails and the new code passes.