GH-49641: [C++] Fix Lz4HadoopCodec to split large blocks for Hadoop compatibility by clee704 · Pull Request #49642 · apache/arrow

clee704 · 2026-04-02T04:21:03Z

Rationale

Lz4HadoopCodec::Compress writes the entire input as a single Hadoop-framed LZ4 block. Hadoop's Lz4Decompressor uses a fixed 256 KiB output buffer per block (IO_COMPRESSION_CODEC_LZ4_BUFFERSIZE_DEFAULT), so blocks decompressing to more than 256 KiB cause an LZ4Exception on JVM readers.

We hit this when writing Parquet dictionary pages >256 KiB with LZ4 compression. The file was written successfully but a JVM reader (parquet-mr + Hadoop) could not decompress the dictionary page. Note this is a read failure, not data corruption — Arrow's own C++ reader handles the file fine.

PARQUET-1878 added the Hadoop-compatible codec with single-block output. ARROW-11301 updated the reader to handle Hadoop's multi-block format; this PR updates the writer to match.

We're also planning to switch our caller to LZ4_RAW (which avoids this entirely), but it seemed worth fixing LZ4_HADOOP since it's a public codec intended for Hadoop compatibility.

What changes are included in this PR?

Split input into blocks of ≤ 256 KiB in Lz4HadoopCodec::Compress and update MaxCompressedLen for per-block prefix overhead. Arrow's reader (TryDecompressHadoop) already handles multiple blocks. No behavioral change for data ≤ 256 KiB (still produces a single block, identical output to before).

Are these changes tested?

Yes — TestCodecLZ4Hadoop.MultiBlockRoundtrip tests compress→decompress round-trip, block size limits, and MaxCompressedLen sufficiency for sizes from 0 to 1 MiB. The block size check fails without the fix, passes with it.

Are there any user-facing changes?

Parquet files written with LZ4_HADOOP compression containing pages >256 KiB will now be readable by JVM-based Parquet readers. No change for files with pages ≤ 256 KiB.

AI-generated code disclosure

This fix was developed with the assistance of an AI coding assistant (GitHub Copilot). The author has reviewed and verified all changes, including validating the fix with standalone tests that confirm the old code fails and the new code passes.

GitHub Issue: [C++] Lz4HadoopCodec::Compress writes single oversized block incompatible with Hadoop Lz4Decompressor #49641

github-actions · 2026-04-02T04:22:16Z

⚠️ GitHub issue #49641 has been automatically assigned in GitHub to PR creator.

github-actions · 2026-04-02T04:22:17Z

⚠️ GitHub issue #49641 has no components, please add labels for components.

github-actions · 2026-04-02T04:24:08Z

⚠️ GitHub issue #49641 has no components, please add labels for components.

pitrou · 2026-04-02T08:03:51Z

Thanks for submitting this PR. Just for the record: I would not consider this a critical fix, as this is just working around a bug/limitation in another Parquet implementation.

github-actions · 2026-04-02T08:04:34Z

⚠️ GitHub issue #49641 has no components, please add labels for components.

pitrou

Thanks! LGTM overall, a suggestion below.

cpp/src/arrow/util/compression_test.cc

pitrou · 2026-04-02T08:15:48Z

By the way @clee704, since this is about generating new files, why not use the newer LZ4_RAW which completely solves the Hadoop compatibility problem?

clee704 · 2026-04-03T01:25:21Z

Thanks for the review @pitrou!

I would not consider this a critical fix, as this is just working around a bug/limitation in another Parquet implementation.

Agreed — updated the description accordingly.

why not use the newer LZ4_RAW which completely solves the Hadoop compatibility problem?

Good suggestion — we're planning to switch our caller to LZ4_RAW. That said, since Lz4HadoopCodec is a public codec intended for Hadoop compatibility, it would be nice if it produced output that Hadoop can read, so this fix still seemed worthwhile.

…doop compatibility Arrow's Lz4HadoopCodec::Compress writes the entire input as a single Hadoop-framed LZ4 block. Hadoop's Lz4Decompressor allocates a fixed 256 KiB output buffer per block (IO_COMPRESSION_CODEC_LZ4_BUFFERSIZE), so any block whose decompressed size exceeds 256 KiB causes an LZ4Exception on the JVM reader. This is a read failure, not data corruption -- the compressed bytes are valid, but Hadoop-based JVM readers cannot decompress them. Fix: split input into blocks of at most 256 KiB uncompressed, each with its own [decompressed_size, compressed_size] big-endian prefix, matching Hadoop's BlockCompressorStream behavior. Arrow's reader (TryDecompressHadoop) already handles multiple blocks. Closes apache#49641. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions bot added Component: C++ awaiting review Awaiting review labels Apr 2, 2026

clee704 force-pushed the fix-lz4-hadoop-block-size branch 4 times, most recently from e3502b2 to 74d6b41 Compare April 2, 2026 04:44

clee704 marked this pull request as draft April 2, 2026 04:46

clee704 force-pushed the fix-lz4-hadoop-block-size branch from 74d6b41 to d8b71e9 Compare April 2, 2026 05:01

clee704 marked this pull request as ready for review April 2, 2026 05:01

pitrou requested changes Apr 2, 2026

View reviewed changes

cpp/src/arrow/util/compression_test.cc Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 2, 2026

clee704 force-pushed the fix-lz4-hadoop-block-size branch from d8b71e9 to 97385c8 Compare April 2, 2026 17:49

clee704 force-pushed the fix-lz4-hadoop-block-size branch from 97385c8 to 77ee206 Compare April 3, 2026 01:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-49641: [C++] Fix Lz4HadoopCodec to split large blocks for Hadoop compatibility#49642

GH-49641: [C++] Fix Lz4HadoopCodec to split large blocks for Hadoop compatibility#49642
clee704 wants to merge 1 commit intoapache:mainfrom
clee704:fix-lz4-hadoop-block-size

clee704 commented Apr 2, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

pitrou commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

pitrou left a comment

Uh oh!

Uh oh!

pitrou commented Apr 2, 2026

Uh oh!

clee704 commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

clee704 commented Apr 2, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

AI-generated code disclosure

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

pitrou commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pitrou commented Apr 2, 2026

Uh oh!

clee704 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clee704 commented Apr 2, 2026 •

edited by github-actions bot

Loading

clee704 commented Apr 3, 2026 •

edited

Loading