Skip to content

fix(spark): runtime-merge full file groups for MOR incremental queries (#18943)#19005

Open
ad1happy2go wants to merge 2 commits into
apache:masterfrom
ad1happy2go:issue-18943-partial-update-incremental-query
Open

fix(spark): runtime-merge full file groups for MOR incremental queries (#18943)#19005
ad1happy2go wants to merge 2 commits into
apache:masterfrom
ad1happy2go:issue-18943-partial-update-incremental-query

Conversation

@ad1happy2go

Copy link
Copy Markdown
Collaborator

Describe the issue this Pull Request addresses

Closes #18943.

Incremental queries on a MOR table can return incorrect rows. Two manifestations:

  • Partial updates: only the changed columns come back (the other columns are null/garbled), because a partial-update log block holds only the changed columns and the base-file row is dropped before the runtime merge.
  • EVENT_TIME_ORDERING (even without partial updates): a window write with a lower ordering value can surface even though the existing higher-ordering version should win.

Snapshot and read-optimized queries are correct; only the incremental path is affected. Because SQL MERGE INTO on MOR writes partial log blocks by default (hoodie.spark.sql.merge.into.partial.updates), this hits the common case.

Summary and Changelog

For each file group touched in the incremental window, the read now loads the full file slice (base file + log files) and runs the standard file-group reader merge, then filters the merged output to the window by commit time. The fix is unconditional — no new config and no write-path change.

  • MergeOnReadIncrementalRelationV2: build the incremental file-system view from the (modified) partition listing (metadata-table-aware) so each slice carries its base file, scoped back to the file groups actually touched in the window; bound the view timeline to the window end and close the view after use.
  • HoodieFileGroupReaderBasedFileFormat: for incremental merging file groups, do not push the commit-time span filter into the file reads — apply it on the merged output instead. Also set an InstantRange bounded to the window end on the reader context, so base records and log blocks committed after the window are not merged in (a record updated again after the window must be returned with its value as of the window end, not its latest value).
  • HoodieReaderContext: add setInstantRange so the read path can bound the merge inputs to the query window.
  • Tests in TestPartialUpdateForMergeInto.

Scope: only the V2 relation listing changed (table version 8+); the format-level gate covers both V1 and V2 reads. No code was copied.

Impact

Incremental queries on MOR tables now return correct, fully-merged rows. Performance: the incremental file listing now builds a metadata-table-aware file-system view over the modified partitions (instead of a window-files-only view) and scopes it to the touched file groups — this can enumerate more file groups on wide partitions, but only the touched groups are read and merged.

Risk Level

medium — this changes the MOR incremental read path. Mitigated by:

  • Snapshot and read-optimized read paths are unchanged; the change is gated on incremental + merging file groups.
  • Broad new test coverage in TestPartialUpdateForMergeInto: partial-update incremental (commit/event-time ordering, avro/parquet, partitioned), non-partial event-time ordering, window bounds, insert+update in one window, multiple partial updates to one key, exclusion of commits after the window end, post-compaction merge, and a COW non-regression.
  • Verified locally on Spark 4.

Documentation Update

none — bug fix, no new config or user-facing API change.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

apache#18943)

Incremental queries on a MOR table could return incorrect rows because the read
resolved each changed record only from the files written within the incremental
window, without a correct runtime merge against the rest of the file group:

- EVENT_TIME_ORDERING (no partial updates): a record written in the window may
  carry a lower ordering value than the version already in the base/earlier log,
  so the existing version should win. A window-only view cannot determine the
  winner and could surface the losing write.
- Partial updates: a window log block holds only the changed columns, so the
  unchanged columns must be filled in from the base file; otherwise the row comes
  back partial/garbled. SQL MERGE INTO on MOR writes partial blocks by default
  (hoodie.spark.sql.merge.into.partial.updates), so this hits the common case.

Snapshot and read-optimized queries were already correct; the bug was isolated
to the incremental file listing plus commit-time filter pushdown.

Fix: for each file group touched in the incremental window, read its base file
plus its log files (the full file slice) and run the standard file-group reader
merge, then filter the merged output to the incremental window by commit time.
Unconditional - no new config and no write-path change.

- MergeOnReadIncrementalRelationV2: build the incremental file system view from
  the (modified) partition listing (metadata-table-aware) so each slice carries
  its base file, scoped back to the file groups actually touched in the window;
  bound the view timeline to the window end and close the view after use.
- HoodieFileGroupReaderBasedFileFormat: for incremental merging file groups, do
  not push the commit-time span filter into the file reads; apply it on the
  merged output instead. Also set an InstantRange bounded to the window end on
  the reader context so base records and log blocks committed after the window
  are not merged in (a record updated again after the window must be returned
  with its value as of the window end, not its latest value).
- HoodieReaderContext: add setInstantRange so the read path can bound the merge
  inputs to the query window.

Scope: only the V2 relation listing changed (table version 8+); the format-level
gate covers both V1 and V2 reads.

Tests in TestPartialUpdateForMergeInto cover partial-update incremental
(commit/event-time ordering, avro/parquet, partitioned), non-partial event-time
ordering, window bounds, insert+update in one window, multiple partial updates
to one key, exclusion of later commits, post-compaction merge, and COW
non-regression.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 15, 2026
@ad1happy2go ad1happy2go force-pushed the issue-18943-partial-update-incremental-query branch 2 times, most recently from cb0b1e9 to 6bfaf46 Compare June 15, 2026 11:01
@ad1happy2go ad1happy2go marked this pull request as ready for review June 16, 2026 05:58
…queries

The incremental file listing introduced in apache#18943 builds a fresh,
metadata-aware file-system view scoped to the touched file groups. This
fixed runtime-merge correctness but silently dropped the fail-early
contract verified by TestIncrementalReadWithFullTableScan: when the
full-table-scan fallback is disabled and files referenced by the window
have been removed (e.g. by cleaning), the query used to throw a
read-time FileNotFoundException; with the fresh listing it instead
returned an empty/partial result because that listing no longer sees the
missing files.

Restore the prior behavior: extract the missing-files check into
hasMissingAffectedFiles (reused by fullTableScan), and when the fallback
is off but files are missing, list slices directly from the recorded
affected files (the pre-apache#18943 path) so the scan surfaces the
file-not-found error pointing at
hoodie.datasource.read.incr.fallback.fulltablescan.enable. The
merge-correct listing is still used whenever the window's files are
present, so the partial-update incremental tests are unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ad1happy2go ad1happy2go force-pushed the issue-18943-partial-update-incremental-query branch from 6bfaf46 to e54305f Compare June 16, 2026 06:14
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.00000% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.62%. Comparing base (86d1650) to head (e54305f).
⚠️ Report is 29 commits behind head on master.

Files with missing lines Patch % Lines
...parquet/HoodieFileGroupReaderBasedFileFormat.scala 71.87% 1 Missing and 8 partials ⚠️
...apache/hudi/MergeOnReadIncrementalRelationV2.scala 77.27% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #19005      +/-   ##
============================================
- Coverage     68.25%   67.62%   -0.63%     
- Complexity    29509    29799     +290     
============================================
  Files          2542     2562      +20     
  Lines        142632   145152    +2520     
  Branches      17789    18346     +557     
============================================
+ Hits          97354    98165     +811     
- Misses        37271    38764    +1493     
- Partials       8007     8223     +216     
Flag Coverage Δ
common-and-other-modules 44.75% <0.00%> (-0.03%) ⬇️
hadoop-mr-java-client 44.69% <0.00%> (+0.01%) ⬆️
spark-client-hadoop-common 48.29% <0.00%> (+0.23%) ⬆️
spark-java-tests 48.31% <75.00%> (-0.49%) ⬇️
spark-scala-tests 44.48% <58.92%> (-0.36%) ⬇️
utilities 37.24% <57.14%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...apache/hudi/common/engine/HoodieReaderContext.java 95.12% <100.00%> (+0.12%) ⬆️
...apache/hudi/MergeOnReadIncrementalRelationV2.scala 64.34% <77.27%> (+2.80%) ⬆️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala 82.84% <71.87%> (-1.17%) ⬇️

... and 142 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR fixes a real correctness gap in the MOR incremental read path — partial updates and event-time-ordering windows now go through a proper runtime merge against the full file slice, with the commit-time window applied post-merge and an InstantRange bound to keep later log blocks out. The approach (full slice listing → bounded InstantRange → post-merge commit-time predicate) is consistent with the rest of the file group reader, and the test matrix is genuinely thorough (commit-time/event-time ordering, partitioned tables, window bounds, insert+update mix, multi-update same key, later-commit exclusion, post-compaction merge, COW non-regression). A couple of small things worth a second look in the inline comments — mostly perf and a question about V1 scope. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small naming and duplication nits below; the overall structure and documentation are clear and well-motivated.

} else {
Some(new JPredicate[InternalRow] {
override def test(row: InternalRow): Boolean =
!row.isNullAt(idx) && allowedCommitTimes.contains(row.getUTF8String(idx).toString)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This row.getUTF8String(idx).toString allocates a Java String for every merged row in the file group. Would it be worth pre-converting allowedCommitTimes to Set[UTF8String] once and comparing UTF8String directly to skip the per-row allocation? Could matter on wide incremental windows.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

} finally {
fsView.close()
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Just to double-check the scope: the PR description says only the V2 listing changed, but HoodieFileGroupReaderBasedFileFormat is the read path for both V1 and V2. V1's listing still builds the view from affectedFilesInCommits only (no pre-window base file), so the runtime merge sees only log records and the partial-update / EVENT_TIME_ORDERING bugs remain on V1. Is that intentional (V1 = table-version < 8, deprecated path), or should V1 listing get the same treatment in a follow-up?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed V2-only and intentional: the commit message scopes the listing fix to table version 8+. V1 (MergeOnReadIncrementalRelationV1.scala:114-115 and :136) still builds the window-only HoodieTableFileSystemView(metaClient, timeline, affectedFilesInCommits), so a MOR incremental read on a table-version < 8 table retains the partial-update / event-time gap. If table-version < 8 incremental MOR is still supported, this is worth a tracked follow-up rather than leaving it implicit.


fallbackToFullTableScan && (startInstantArchived
|| affectedFilesInCommits.asScala.exists(fileStatus => !metaClient.getStorage.exists(fileStatus.getPath)))
fallbackToFullTableScan && (startInstantArchived || hasMissingAffectedFiles)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 With fallbackToFullTableScan=false, hasMissingAffectedFiles is now also evaluated in collectFileSplits/listFileSplits to choose between the new and legacy paths, which triggers an exists() per affected file at plan time even when before it was short-circuited. Could be slow for wide windows on object stores (one head request per file). Lazy val avoids repeat eval, so probably fine, but worth confirming this matches the intent.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

engineContext, metaClient, fileIndex.metadataConfig, timeline.findInstantsBeforeOrEquals(latestCommit))
try {
partitionPaths.flatMap { relativePartitionPath =>
fsView.getLatestMergedFileSlicesBeforeOrOn(relativePartitionPath, latestCommit).iterator().asScala

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: fs conventionally means FileSystem/storage throughout Hudi — using it for a FileSlice lambda here could trip up a reader. Could you rename it to slice or fileSlice?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

None
} else {
readSchema.getFieldIndex(HoodieRecord.COMMIT_TIME_METADATA_FIELD).flatMap { idx =>
val allowedCommitTimes: Set[String] = requiredFilters.collect {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the case In(attr, values) if attr == HoodieRecord.COMMIT_TIME_METADATA_FIELD => values.filter(_ != null).map(_.toString) pattern is extracted verbatim in both incrementalWindowEnd (line 397) and here. Have you considered pulling it into a small private helper like commitTimesFromFilters(filters: Seq[Filter]): Set[String] to keep the two callers in sync if the filter shape ever changes?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

ad1happy2go added a commit to ad1happy2go/hudi that referenced this pull request Jun 17, 2026
apache#18943) (apache#19005)

* fix(spark): runtime-merge full file groups for MOR incremental queries (apache#18943)
* fix(spark): preserve fail-early on missing files for MOR incremental queries
* fix(spark): null-safe InstantRange nullable boundary so the window-end-only bound used by the runtime merge does not NPE

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* so that base/log files written by later commits are not visible. Without this bound a record
* updated again after the window would be merged with those later log files and its merged commit
* time would fall outside the window, dropping the in-window change from the result.
*/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This affectedFileGroupIds filter is the only thing scoping the metadata-aware partition view back to the touched file groups, but no test covers a modified partition that holds an untouched sibling file group. testPartialUpdateIncrementalQueryPartitioned writes one updated record per partition, so with MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT=0 each partition is a single file group and this filter is a no-op - removing it would fail no test. Suggest a case that lands multiple file groups in one partition and updates only one, asserting the untouched sibling is excluded. Output stays correct via the post-merge commit-time filter either way, so this guards the scoping/perf path.

* of silently returning an empty/partial result from a fresh listing that no longer sees those
* files.
*/
private def legacyAffectedFileSlices(partitionPaths: Seq[String], latestCommit: String): Seq[FileSlice] = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

legacyAffectedFileSlices and collectIncrementalFileSlices share the same partitionPaths.flatMap { p => view.getLatestMergedFileSlicesBeforeOrOn(p, latestCommit).iterator().asScala [.filter(...)] } body wrapped in try/finally view.close(), differing only in how the view is built and whether the file-group filter applies. Consider a private helper taking the view plus a FileSlice => Boolean predicate that owns the try/finally close, so the two call sites cannot drift on the close() contract.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Partial updates do not support incremental query

5 participants