fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile sorting#18941
fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile sorting#18941nada-attia wants to merge 2 commits into
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! The PR adds a UTF-8 byte-wise comparator and applies it to the metadata-table bulk-insert partitioners, the HFile log-block sort in HoodieAppendHandle, and the single-slice / prefix lookup paths in HoodieBackedTableMetadata. One concern worth checking is whether the multi-slice record-index lookup path also needs the same fix — see the inline comment. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small Javadoc direction note in StringUtils; otherwise the changes are well-structured and the comments explaining the HFile ordering rationale are genuinely useful.
| * Compares two strings by their unsigned UTF-8 byte order. See {@link #UTF8_LEXICOGRAPHIC_COMPARATOR}. | ||
| */ | ||
| public static int compareUtf8Bytes(String s1, String s2) { | ||
| byte[] b1 = getUTF8Bytes(s1); |
There was a problem hiding this comment.
🤖 nit: the Javadoc on compareUtf8Bytes just redirects to the constant, but this method is called directly at several sites — so a reader stopping here wants the full "why" (HFile UTF-8 ordering, not UTF-16). Could you move the detailed explanation onto the method and have the constant's Javadoc say something like "Serializable wrapper around {@link #compareUtf8Bytes}; see that method for the full rationale"?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
There was a problem hiding this comment.
🤖 Line 323: Could you confirm this multi-slice path is also covered? mapGroupsByKey → rangeBasedRepartitionForEachKey uses ConditionalRangePartitioner.CompositeKeyComparator, which calls String.compareTo on the value (the encoded key). So inside processFunction (line 309-321), sortedKeys arrives in UTF-16 order, keysList preserves that order, and lookupRecordsItr → Predicates.in → extractKeys → RecordByKeyIterator.seekTo then does the same forward-only HFile seek on UTF-16-ordered keys. This is the common RLI lookup path (any RLI table with >1 file group), so for tables with non-ASCII keys the same readRecordIndex silent-miss bug described in the PR would still happen here. @yihua does this need a separate fix (e.g. resort keysList with UTF8_LEXICOGRAPHIC_COMPARATOR inside the processFunction, or a UTF-8-aware comparator path through mapGroupsByKey)?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for adding the regression test! The new testRecordIndexBootstrapWithBinaryRecordKeys exercises the exact failure mode the PR fixes — binary keys whose UTF-16 char order is the reverse of their UTF-8 byte order — and would have caught the original "Added a key not lexically larger than previous" bug. Two prior comments still appear open: the StringUtils.java Javadoc nit, and the multi-slice HoodieBackedTableMetadata readRecordIndex concern (this test pins the RI to 1 file group via withRecordIndexFileGroupCount(1, 1), so it only exercises the single-slice path that was fixed; the mapGroupsByKey multi-slice path is not covered here). No new issues in the test itself. Please take a look at the still-open inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18941 +/- ##
============================================
- Coverage 68.20% 68.19% -0.01%
+ Complexity 29458 29443 -15
============================================
Files 2542 2542
Lines 142545 142556 +11
Branches 17778 17783 +5
============================================
+ Hits 97218 97219 +1
- Misses 37316 37321 +5
- Partials 8011 8016 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
The metadata table record index is persisted as HFiles, which order keys by their raw UTF-8 bytes (
HoodieHBaseKVComparator/ HBaseCellComparatorImpl). However, several code paths sort record keys withString.compareTo(UTF-16 char order). For ASCII keys the two orderings are identical, but for binary / non-ASCII record keys they diverge (e.g. supplementary characters, whose UTF-16 surrogate units sort below high BMP code points, but whose UTF-8 lead byte0xF0sorts above0xEE/0xEF). As a result:java.io.IOException: Added a key not lexically larger than previous.readRecordIndex) silently miss the diverging half of keys, because the HFile reader seeks forward without rewinding.JIRA: HUDI-8898
Summary and Changelog
StringUtils: addUTF8_LEXICOGRAPHIC_COMPARATORandcompareUtf8Bytes— an unsigned UTF-8 byte-wise comparison that matches the ordering HFiles enforce.SparkHoodieMetadataBulkInsertPartitionerandJavaHoodieMetadataBulkInsertPartitioner(base-file write path).HFILE_DATA_BLOCKlog-block records by UTF-8 bytes inHoodieAppendHandle(MOR log write path).HoodieBackedTableMetadata(getRecordsByKeyPrefixesand the single-slice key set) so forward-only HFile seeks resolve every key.Impact
No public API or storage-format change. Fixes correctness for tables whose record keys contain non-ASCII / binary characters when the metadata record index is enabled. ASCII keys are unaffected (UTF-8 byte order equals
Stringorder for ASCII).Risk Level
low. For ASCII / BMP-only keys the new comparator produces ordering identical to
String.compareTo; only keys containing supplementary characters change order, and that change is precisely the HFile-required ordering. The comparator uses unsigned byte comparison to matchCellComparatorImplexactly, and never collapses distinct keys to equal (UTF-8 is a bijection).Documentation Update
none
Contributor's checklist