Skip to content

perf(io): Avoid per-entry KeyValue allocation in HFileDataBlock.seekTo#19021

Open
wombatu-kun wants to merge 3 commits into
apache:masterfrom
wombatu-kun:perf-io-hfile-seekto-buffer-compare
Open

perf(io): Avoid per-entry KeyValue allocation in HFileDataBlock.seekTo#19021
wombatu-kun wants to merge 3 commits into
apache:masterfrom
wombatu-kun:perf-io-hfile-seekto-buffer-compare

Conversation

@wombatu-kun

@wombatu-kun wombatu-kun commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

The native HFile reader's HFileDataBlock.seekTo is the hottest inner loop on the metadata-table read path (record-level index, bloom filter and column-stats point lookups, which run on essentially every write). For each entry it scanned it allocated a KeyValue and its Key just to compare the entry key against the lookup key and to compute the stride to the next entry, producing two short-lived objects per scanned entry and avoidable GC pressure under point-lookup workloads.

Summary and Changelog

HFileDataBlock.seekTo now compares the entry key directly against the backing block buffer and computes the stride from the on-disk length fields, instead of materializing a KeyValue/Key for every scanned entry. A KeyValue is materialized only on an exact match. For the "in range" and end-of-block cases the cursor is pointed at the previous offset and the read is deferred, which getKeyValue() already performs lazily. The lookup key may be a UTF8StringKey, so its polymorphic content accessors are used for the comparison. No other class is touched and the original Option-based cursor is unchanged.

Impact

No public API or on-disk format change. Lower-allocation, faster point lookups on the metadata-table read path. JMH microbenchmark over an uncompressed HFile fixture (5000 entries, 625 sorted point lookups; forks(0); gc.alloc.rate.norm):

Workload Metric Before After Delta
Point lookups allocation (B/op) 677,729 363,721 -46%
Point lookups throughput (ops/ms) 5.25 6.16 +17%
Full scan (not on the seekTo path) allocation (B/op) 643,705 643,681 unchanged

Risk Level

low. The change is confined to one method, preserves all seekTo return codes and the cursor's lazy-read semantics, and is exercised by the existing HFile reader suite (point, prefix, non-unique and fake-first-key seeks, sequential reads, empty file, and HBase read/write compatibility). The full hudi-io module test suite (101 tests) and checkstyle pass.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

HFileDataBlock.seekTo materialized a KeyValue (and its Key) for every entry it scanned, only to compare the entry key and compute the stride to the next entry. On the metadata-table read path (record-level index, bloom filter and column-stats point lookups) this is the hottest inner loop, allocating two short-lived objects per scanned entry.

This compares the entry key directly against the backing block buffer and computes the stride from the on-disk length fields, materializing a KeyValue only on an exact match. The "in range" and end-of-block cases point the cursor at the previous offset and defer the read, which getKeyValue() already performs lazily. The lookup key is a UTF8StringKey, so its polymorphic content accessors are used for the comparison. No on-disk format or public API change.

JMH microbenchmark over an uncompressed HFile fixture (5000 entries, 625 sorted point lookups), forks(0), gc.alloc.rate.norm and throughput:

point lookups: 677,729 -> 363,721 B/op (-46%), 5.25 -> 6.16 ops/ms (+17%)
full scan (seekTo is not on that path): 643,705 -> 643,681 B/op (unchanged)
@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Jun 16, 2026

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR avoids per-entry KeyValue/Key allocations in HFileDataBlock.seekTo by comparing entry keys directly against the backing buffer and computing the stride from the on-disk length fields, materializing a KeyValue only on an exact match. I traced the buffer layout, comparison semantics, stride computation, and the deferred-read fallback through HFileCursor, HFileReaderImpl.getKeyValue, and HFileDataBlock.next, and the behavior matches the original across all paths (exact match, in-range, before-block-first-key, end-of-block, and empty-loop). No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming suggestion on the inlined readInt calls; rest of the diff is clean.

cc @yihua

}
long increment =
(long) KEY_OFFSET + (long) kv.getKeyLength() + (long) kv.getValueLength()
(long) KEY_OFFSET + (long) IOUtils.readInt(byteBuff, relativeOffset)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you extract these two readInt calls into named locals (e.g. entryKeyLength / entryValueLength) before computing increment? The original kv.getKeyLength() / kv.getValueLength() were self-documenting; the raw offset reads here require tracing back to the layout comment ~30 lines earlier to understand what each one returns.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 35c1441

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR removes per-entry KeyValue allocation in HFileDataBlock.seekTo by comparing the entry key directly against the backing buffer and computing the stride from the on-disk length fields, materializing a KeyValue only on exact match and deferring otherwise to the cursor's lazy read. I traced the offset arithmetic (KEY_OFFSET + KEY_LENGTH_LENGTH matches the KeyValue layout), the comparison semantics (equivalent to compareKeys via IOUtils.compareTo), and the comp > 0 / end-of-block paths (lazy materialization through HFileReaderImpl.getKeyValue() preserves behavior). No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

// key may be a UTF8StringKey, so use the polymorphic content accessors (no 2-byte prefix).
byte[] lookupBytes = key.getBytes();
int lookupContentOffset = key.getContentOffset();
int lookupContentLength = key.getContentLength();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some UTs to validate this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 82abf7d. Added a focused test in TestHFileReader that writes a small-block-size HFile (several entries per data block) and validates seekTo across SEEK_TO_FOUND, SEEK_TO_IN_RANGE, SEEK_TO_BEFORE_FILE_FIRST_KEY, and SEEK_TO_EOF; the FOUND/IN_RANGE cases land mid-block so they exercise the new buffer-direct key comparison and the deferred cursor read.

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR avoids per-entry KeyValue/Key allocation in HFileDataBlock.seekTo by comparing the entry key directly against the backing buffer and computing the stride from the on-disk length fields, materializing a KeyValue only on exact match. Layout, comparison semantics, and cursor lazy-read behavior all line up with the original code, and the new test exercises both buffer-direct comparison and the deferred-cursor in-range path across multiple data blocks. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@yihua yihua left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocked on my review as the core change to the file format

@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous voonhous added the area:performance Performance optimizations label Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:performance Performance optimizations size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants