perf(io): Avoid per-entry KeyValue allocation in HFileDataBlock.seekTo#19021
perf(io): Avoid per-entry KeyValue allocation in HFileDataBlock.seekTo#19021wombatu-kun wants to merge 3 commits into
Conversation
HFileDataBlock.seekTo materialized a KeyValue (and its Key) for every entry it scanned, only to compare the entry key and compute the stride to the next entry. On the metadata-table read path (record-level index, bloom filter and column-stats point lookups) this is the hottest inner loop, allocating two short-lived objects per scanned entry. This compares the entry key directly against the backing block buffer and computes the stride from the on-disk length fields, materializing a KeyValue only on an exact match. The "in range" and end-of-block cases point the cursor at the previous offset and defer the read, which getKeyValue() already performs lazily. The lookup key is a UTF8StringKey, so its polymorphic content accessors are used for the comparison. No on-disk format or public API change. JMH microbenchmark over an uncompressed HFile fixture (5000 entries, 625 sorted point lookups), forks(0), gc.alloc.rate.norm and throughput: point lookups: 677,729 -> 363,721 B/op (-46%), 5.25 -> 6.16 ops/ms (+17%) full scan (seekTo is not on that path): 643,705 -> 643,681 B/op (unchanged)
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR avoids per-entry KeyValue/Key allocations in HFileDataBlock.seekTo by comparing entry keys directly against the backing buffer and computing the stride from the on-disk length fields, materializing a KeyValue only on an exact match. I traced the buffer layout, comparison semantics, stride computation, and the deferred-read fallback through HFileCursor, HFileReaderImpl.getKeyValue, and HFileDataBlock.next, and the behavior matches the original across all paths (exact match, in-range, before-block-first-key, end-of-block, and empty-loop). No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming suggestion on the inlined readInt calls; rest of the diff is clean.
cc @yihua
| } | ||
| long increment = | ||
| (long) KEY_OFFSET + (long) kv.getKeyLength() + (long) kv.getValueLength() | ||
| (long) KEY_OFFSET + (long) IOUtils.readInt(byteBuff, relativeOffset) |
There was a problem hiding this comment.
🤖 nit: could you extract these two readInt calls into named locals (e.g. entryKeyLength / entryValueLength) before computing increment? The original kv.getKeyLength() / kv.getValueLength() were self-documenting; the raw offset reads here require tracing back to the layout comment ~30 lines earlier to understand what each one returns.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
… locals in HFileDataBlock.seekTo
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR removes per-entry KeyValue allocation in HFileDataBlock.seekTo by comparing the entry key directly against the backing buffer and computing the stride from the on-disk length fields, materializing a KeyValue only on exact match and deferring otherwise to the cursor's lazy read. I traced the offset arithmetic (KEY_OFFSET + KEY_LENGTH_LENGTH matches the KeyValue layout), the comparison semantics (equivalent to compareKeys via IOUtils.compareTo), and the comp > 0 / end-of-block paths (lazy materialization through HFileReaderImpl.getKeyValue() preserves behavior). No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
| // key may be a UTF8StringKey, so use the polymorphic content accessors (no 2-byte prefix). | ||
| byte[] lookupBytes = key.getBytes(); | ||
| int lookupContentOffset = key.getContentOffset(); | ||
| int lookupContentLength = key.getContentLength(); |
There was a problem hiding this comment.
can we add some UTs to validate this.
There was a problem hiding this comment.
Done 82abf7d. Added a focused test in TestHFileReader that writes a small-block-size HFile (several entries per data block) and validates seekTo across SEEK_TO_FOUND, SEEK_TO_IN_RANGE, SEEK_TO_BEFORE_FILE_FIRST_KEY, and SEEK_TO_EOF; the FOUND/IN_RANGE cases land mid-block so they exercise the new buffer-direct key comparison and the deferred cursor read.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR avoids per-entry KeyValue/Key allocation in HFileDataBlock.seekTo by comparing the entry key directly against the backing buffer and computing the stride from the on-disk length fields, materializing a KeyValue only on exact match. Layout, comparison semantics, and cursor lazy-read behavior all line up with the original code, and the new test exercises both buffer-direct comparison and the deferred-cursor in-range path across multiple data blocks. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
yihua
left a comment
There was a problem hiding this comment.
Blocked on my review as the core change to the file format
Describe the issue this Pull Request addresses
The native HFile reader's
HFileDataBlock.seekTois the hottest inner loop on the metadata-table read path (record-level index, bloom filter and column-stats point lookups, which run on essentially every write). For each entry it scanned it allocated aKeyValueand itsKeyjust to compare the entry key against the lookup key and to compute the stride to the next entry, producing two short-lived objects per scanned entry and avoidable GC pressure under point-lookup workloads.Summary and Changelog
HFileDataBlock.seekTonow compares the entry key directly against the backing block buffer and computes the stride from the on-disk length fields, instead of materializing aKeyValue/Keyfor every scanned entry. AKeyValueis materialized only on an exact match. For the "in range" and end-of-block cases the cursor is pointed at the previous offset and the read is deferred, whichgetKeyValue()already performs lazily. The lookup key may be aUTF8StringKey, so its polymorphic content accessors are used for the comparison. No other class is touched and the originalOption-based cursor is unchanged.Impact
No public API or on-disk format change. Lower-allocation, faster point lookups on the metadata-table read path. JMH microbenchmark over an uncompressed HFile fixture (5000 entries, 625 sorted point lookups;
forks(0);gc.alloc.rate.norm):Risk Level
low. The change is confined to one method, preserves all seekTo return codes and the cursor's lazy-read semantics, and is exercised by the existing HFile reader suite (point, prefix, non-unique and fake-first-key seeks, sequential reads, empty file, and HBase read/write compatibility). The full hudi-io module test suite (101 tests) and checkstyle pass.
Documentation Update
none
Contributor's checklist