fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile sorting by nada-attia · Pull Request #18941 · apache/hudi

nada-attia · 2026-06-08T15:35:35Z

Describe the issue this Pull Request addresses

The metadata table record index is persisted as HFiles, which order keys by their raw UTF-8 bytes (HoodieHBaseKVComparator / HBase CellComparatorImpl). However, several code paths sort record keys with String.compareTo (UTF-16 char order). For ASCII keys the two orderings are identical, but for binary / non-ASCII record keys they diverge (e.g. supplementary characters, whose UTF-16 surrogate units sort below high BMP code points, but whose UTF-8 lead byte 0xF0 sorts above 0xEE/0xEF). As a result:

Record index bootstrap / writes fail with java.io.IOException: Added a key not lexically larger than previous.
Record index point lookups (readRecordIndex) silently miss the diverging half of keys, because the HFile reader seeks forward without rewinding.

JIRA: HUDI-8898

Summary and Changelog

StringUtils: add UTF8_LEXICOGRAPHIC_COMPARATOR and compareUtf8Bytes — an unsigned UTF-8 byte-wise comparison that matches the ordering HFiles enforce.
Sort metadata-table bulk-insert records by UTF-8 bytes in SparkHoodieMetadataBulkInsertPartitioner and JavaHoodieMetadataBulkInsertPartitioner (base-file write path).
Sort HFILE_DATA_BLOCK log-block records by UTF-8 bytes in HoodieAppendHandle (MOR log write path).
Sort record-index lookup keys by UTF-8 bytes in HoodieBackedTableMetadata (getRecordsByKeyPrefixes and the single-slice key set) so forward-only HFile seeks resolve every key.

Impact

No public API or storage-format change. Fixes correctness for tables whose record keys contain non-ASCII / binary characters when the metadata record index is enabled. ASCII keys are unaffected (UTF-8 byte order equals String order for ASCII).

Risk Level

low. For ASCII / BMP-only keys the new comparator produces ordering identical to String.compareTo; only keys containing supplementary characters change order, and that change is precisely the HFile-required ordering. The comparator uses unsigned byte comparison to match CellComparatorImpl exactly, and never collapses distinct keys to equal (UTF-8 is a bijection).

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…ordering

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR adds a UTF-8 byte-wise comparator and applies it to the metadata-table bulk-insert partitioners, the HFile log-block sort in HoodieAppendHandle, and the single-slice / prefix lookup paths in HoodieBackedTableMetadata. One concern worth checking is whether the multi-slice record-index lookup path also needs the same fix — see the inline comment. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small Javadoc direction note in StringUtils; otherwise the changes are well-structured and the comments explaining the HFile ordering rationale are genuinely useful.

hudi-agent · 2026-06-08T15:46:38Z

+   * Compares two strings by their unsigned UTF-8 byte order. See {@link #UTF8_LEXICOGRAPHIC_COMPARATOR}.
+   */
+  public static int compareUtf8Bytes(String s1, String s2) {
+    byte[] b1 = getUTF8Bytes(s1);


🤖 nit: the Javadoc on compareUtf8Bytes just redirects to the constant, but this method is called directly at several sites — so a reader stopping here wants the full "why" (HFile UTF-8 ordering, not UTF-16). Could you move the detailed explanation onto the method and have the constant's Javadoc say something like "Serializable wrapper around {@link #compareUtf8Bytes}; see that method for the full rationale"?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-08T15:46:39Z

🤖 Line 323: Could you confirm this multi-slice path is also covered? mapGroupsByKey → rangeBasedRepartitionForEachKey uses ConditionalRangePartitioner.CompositeKeyComparator, which calls String.compareTo on the value (the encoded key). So inside processFunction (line 309-321), sortedKeys arrives in UTF-16 order, keysList preserves that order, and lookupRecordsItr → Predicates.in → extractKeys → RecordByKeyIterator.seekTo then does the same forward-only HFile seek on UTF-16-ordered keys. This is the common RLI lookup path (any RLI table with >1 file group), so for tables with non-ASCII keys the same readRecordIndex silent-miss bug described in the PR would still happen here. @yihua does this need a separate fix (e.g. resort keysList with UTF8_LEXICOGRAPHIC_COMPARATOR inside the processFunction, or a UTF-8-aware comparator path through mapGroupsByKey)?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for adding the regression test! The new testRecordIndexBootstrapWithBinaryRecordKeys exercises the exact failure mode the PR fixes — binary keys whose UTF-16 char order is the reverse of their UTF-8 byte order — and would have caught the original "Added a key not lexically larger than previous" bug. Two prior comments still appear open: the StringUtils.java Javadoc nit, and the multi-slice HoodieBackedTableMetadata readRecordIndex concern (this test pins the RI to 1 file group via withRecordIndexFileGroupCount(1, 1), so it only exercises the single-slice path that was fixed; the mapGroupsByKey multi-slice path is not covered here). No new issues in the test itself. Please take a look at the still-open inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.

hudi-bot · 2026-06-08T17:49:01Z

CI report:

5833d07 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-06-08T18:12:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.19%. Comparing base (aac975c) to head (5833d07).

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18941      +/-   ##
============================================
- Coverage     68.20%   68.19%   -0.01%     
+ Complexity    29458    29443      -15     
============================================
  Files          2542     2542              
  Lines        142545   142556      +11     
  Branches      17778    17783       +5     
============================================
+ Hits          97218    97219       +1     
- Misses        37316    37321       +5     
- Partials       8011     8016       +5

Flag	Coverage Δ
common-and-other-modules	`44.68% <93.75%> (+<0.01%)`	⬆️
hadoop-mr-java-client	`44.77% <100.00%> (+0.03%)`	⬆️
spark-client-hadoop-common	`48.05% <100.00%> (+<0.01%)`	⬆️
spark-java-tests	`48.71% <93.75%> (-0.07%)`	⬇️
spark-scala-tests	`44.86% <93.75%> (+<0.01%)`	⬆️
utilities	`37.29% <93.75%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...in/java/org/apache/hudi/io/HoodieAppendHandle.java	`77.87% <100.00%> (ø)`
...adata/JavaHoodieMetadataBulkInsertPartitioner.java	`88.88% <100.00%> (ø)`
...data/SparkHoodieMetadataBulkInsertPartitioner.java	`96.55% <100.00%> (ø)`
...pache/hudi/metadata/HoodieBackedTableMetadata.java	`82.74% <100.00%> (+0.09%)`	⬆️
.../java/org/apache/hudi/common/util/StringUtils.java	`74.12% <100.00%> (+1.73%)`	⬆️

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fix(HUDI-8898): sort record index keys by UTF-8 bytes to match HFile …

be22fd8

…ordering

nada-attia changed the title ~~fix(HUDI-8898): sort record index keys by UTF-8 bytes to match HFile …~~ fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile … Jun 8, 2026

nada-attia changed the title ~~fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile …~~ fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile sorting Jun 8, 2026

hudi-agent reviewed Jun 8, 2026

View reviewed changes

test(HUDI-8898): add record index bootstrap test for binary record keys

5833d07

github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 8, 2026

hudi-agent reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile sorting#18941

fix(record-index-bootstrap): sort record index keys by UTF-8 bytes to match HFile sorting#18941
nada-attia wants to merge 2 commits into
apache:masterfrom
nada-attia:nada.attia/ri-bootstrap-binary-keys-oss

nada-attia commented Jun 8, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Jun 8, 2026

Uh oh!

hudi-agent Jun 8, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-bot commented Jun 8, 2026

Uh oh!

codecov-commenter commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nada-attia commented Jun 8, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jun 8, 2026

CI report:

Uh oh!

codecov-commenter commented Jun 8, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants