Skip to content

fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile sorting#18941

Open
nada-attia wants to merge 2 commits into
apache:masterfrom
nada-attia:nada.attia/ri-bootstrap-binary-keys-oss
Open

fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile sorting#18941
nada-attia wants to merge 2 commits into
apache:masterfrom
nada-attia:nada.attia/ri-bootstrap-binary-keys-oss

Conversation

@nada-attia

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

The metadata table record index is persisted as HFiles, which order keys by their raw UTF-8 bytes (HoodieHBaseKVComparator / HBase CellComparatorImpl). However, several code paths sort record keys with String.compareTo (UTF-16 char order). For ASCII keys the two orderings are identical, but for binary / non-ASCII record keys they diverge (e.g. supplementary characters, whose UTF-16 surrogate units sort below high BMP code points, but whose UTF-8 lead byte 0xF0 sorts above 0xEE/0xEF). As a result:

  • Record index bootstrap / writes fail with java.io.IOException: Added a key not lexically larger than previous.
  • Record index point lookups (readRecordIndex) silently miss the diverging half of keys, because the HFile reader seeks forward without rewinding.

JIRA: HUDI-8898

Summary and Changelog

  • StringUtils: add UTF8_LEXICOGRAPHIC_COMPARATOR and compareUtf8Bytes — an unsigned UTF-8 byte-wise comparison that matches the ordering HFiles enforce.
  • Sort metadata-table bulk-insert records by UTF-8 bytes in SparkHoodieMetadataBulkInsertPartitioner and JavaHoodieMetadataBulkInsertPartitioner (base-file write path).
  • Sort HFILE_DATA_BLOCK log-block records by UTF-8 bytes in HoodieAppendHandle (MOR log write path).
  • Sort record-index lookup keys by UTF-8 bytes in HoodieBackedTableMetadata (getRecordsByKeyPrefixes and the single-slice key set) so forward-only HFile seeks resolve every key.

Impact

No public API or storage-format change. Fixes correctness for tables whose record keys contain non-ASCII / binary characters when the metadata record index is enabled. ASCII keys are unaffected (UTF-8 byte order equals String order for ASCII).

Risk Level

low. For ASCII / BMP-only keys the new comparator produces ordering identical to String.compareTo; only keys containing supplementary characters change order, and that change is precisely the HFile-required ordering. The comparator uses unsigned byte comparison to match CellComparatorImpl exactly, and never collapses distinct keys to equal (UTF-8 is a bijection).

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@nada-attia nada-attia changed the title fix(HUDI-8898): sort record index keys by UTF-8 bytes to match HFile … fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile … Jun 8, 2026
@nada-attia nada-attia changed the title fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile … fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile sorting Jun 8, 2026

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR adds a UTF-8 byte-wise comparator and applies it to the metadata-table bulk-insert partitioners, the HFile log-block sort in HoodieAppendHandle, and the single-slice / prefix lookup paths in HoodieBackedTableMetadata. One concern worth checking is whether the multi-slice record-index lookup path also needs the same fix — see the inline comment. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small Javadoc direction note in StringUtils; otherwise the changes are well-structured and the comments explaining the HFile ordering rationale are genuinely useful.

* Compares two strings by their unsigned UTF-8 byte order. See {@link #UTF8_LEXICOGRAPHIC_COMPARATOR}.
*/
public static int compareUtf8Bytes(String s1, String s2) {
byte[] b1 = getUTF8Bytes(s1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the Javadoc on compareUtf8Bytes just redirects to the constant, but this method is called directly at several sites — so a reader stopping here wants the full "why" (HFile UTF-8 ordering, not UTF-16). Could you move the detailed explanation onto the method and have the constant's Javadoc say something like "Serializable wrapper around {@link #compareUtf8Bytes}; see that method for the full rationale"?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Line 323: Could you confirm this multi-slice path is also covered? mapGroupsByKeyrangeBasedRepartitionForEachKey uses ConditionalRangePartitioner.CompositeKeyComparator, which calls String.compareTo on the value (the encoded key). So inside processFunction (line 309-321), sortedKeys arrives in UTF-16 order, keysList preserves that order, and lookupRecordsItrPredicates.inextractKeysRecordByKeyIterator.seekTo then does the same forward-only HFile seek on UTF-16-ordered keys. This is the common RLI lookup path (any RLI table with >1 file group), so for tables with non-ASCII keys the same readRecordIndex silent-miss bug described in the PR would still happen here. @yihua does this need a separate fix (e.g. resort keysList with UTF8_LEXICOGRAPHIC_COMPARATOR inside the processFunction, or a UTF-8-aware comparator path through mapGroupsByKey)?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 8, 2026

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for adding the regression test! The new testRecordIndexBootstrapWithBinaryRecordKeys exercises the exact failure mode the PR fixes — binary keys whose UTF-16 char order is the reverse of their UTF-8 byte order — and would have caught the original "Added a key not lexically larger than previous" bug. Two prior comments still appear open: the StringUtils.java Javadoc nit, and the multi-slice HoodieBackedTableMetadata readRecordIndex concern (this test pins the RI to 1 file group via withRecordIndexFileGroupCount(1, 1), so it only exercises the single-slice path that was fixed; the mapGroupsByKey multi-slice path is not covered here). No new issues in the test itself. Please take a look at the still-open inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.

@hudi-bot

hudi-bot commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.19%. Comparing base (aac975c) to head (5833d07).

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18941      +/-   ##
============================================
- Coverage     68.20%   68.19%   -0.01%     
+ Complexity    29458    29443      -15     
============================================
  Files          2542     2542              
  Lines        142545   142556      +11     
  Branches      17778    17783       +5     
============================================
+ Hits          97218    97219       +1     
- Misses        37316    37321       +5     
- Partials       8011     8016       +5     
Flag Coverage Δ
common-and-other-modules 44.68% <93.75%> (+<0.01%) ⬆️
hadoop-mr-java-client 44.77% <100.00%> (+0.03%) ⬆️
spark-client-hadoop-common 48.05% <100.00%> (+<0.01%) ⬆️
spark-java-tests 48.71% <93.75%> (-0.07%) ⬇️
spark-scala-tests 44.86% <93.75%> (+<0.01%) ⬆️
utilities 37.29% <93.75%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...in/java/org/apache/hudi/io/HoodieAppendHandle.java 77.87% <100.00%> (ø)
...adata/JavaHoodieMetadataBulkInsertPartitioner.java 88.88% <100.00%> (ø)
...data/SparkHoodieMetadataBulkInsertPartitioner.java 96.55% <100.00%> (ø)
...pache/hudi/metadata/HoodieBackedTableMetadata.java 82.74% <100.00%> (+0.09%) ⬆️
.../java/org/apache/hudi/common/util/StringUtils.java 74.12% <100.00%> (+1.73%) ⬆️

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants