Skip to content

[HUDI-18139] fix(spark): make post-sync Spark catalog cache refresh best-effort#18975

Open
ad1happy2go wants to merge 2 commits into
apache:masterfrom
ad1happy2go:HUDI-18139-catalog-refresh-database
Open

[HUDI-18139] fix(spark): make post-sync Spark catalog cache refresh best-effort#18975
ad1happy2go wants to merge 2 commits into
apache:masterfrom
ad1happy2go:HUDI-18139-catalog-refresh-database

Conversation

@ad1happy2go

@ad1happy2go ad1happy2go commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Describe the issue this Pull Request addresses

After a successful write + meta-sync, HoodieSparkSqlWriter.metaSync refreshes the Spark catalog relation cache for the synced table so later reads in the same session see the new data. That refresh re-resolves the table by name, which for a Hudi table eagerly reads its .hoodie metadata from storage.

Reported in #18139: with AWS Glue sync, the write and sync succeed, but the job then fails with AccessDenied while refreshing a same-named table that belongs to the default database (pointing at an unrelated, inaccessible bucket).

Two problems, reproduced on Spark 3.5 / 4.0:

  • An unqualified table name resolves against the session's current/default database, so a same-named table there is read by mistake. The name is already qualified with the sync database (hoodie.datasource.hive_sync.database); this change preserves that.
  • The refresh was unguarded, so any failure (a transient catalog error, or a same-named table backed by storage the writer cannot access) propagated and failed an already-committed, already-synced write.

Summary and Changelog

Cache invalidation after a write is purely a local-session optimization and must never fail a committed write. This change makes the post-sync Spark catalog cache refresh best-effort.

  • Extract the invalidation into HoodieSparkSqlWriterInternal.refreshSparkCatalogTableCache.
  • Keep the database-qualified name so a same-named table in another database - in particular the session's current/default database - is never resolved and refreshed by mistake.
  • Wrap each table's refresh in a NonFatal try/catch that logs at WARN and continues, with an outer guard around the database lookup.
  • Add TestSparkCatalogCacheRefresh covering both the wrong-database resolution and the swallow-on-failure paths.

Impact

A failure while refreshing the Spark catalog cache after a successful write+sync no longer fails the write. The only behavioral change: if a refresh genuinely fails, a subsequent read in the same session may serve stale cached data (logged at WARN); the committed table is unaffected.

Risk Level

low

Behavior change is limited to the post-sync, best-effort catalog cache invalidation step.

Documentation Update

None required.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable

🤖 Generated with Claude Code

…139)

After a successful write + meta-sync, Hudi refreshes the Spark catalog
relation cache for the synced table so later reads in the same session see
the new data. That refresh re-resolves the table by name, which for a Hudi
table eagerly reads its .hoodie metadata from storage.

Two problems, reproduced on Spark 3.5/4.0:
- An unqualified table name resolves against the session's current/`default`
  database, so a same-named table there (pointing at unrelated/inaccessible
  storage) is read by mistake. The name is already qualified with the sync
  database; this keeps that behavior.
- The refresh was unguarded, so any failure (a transient catalog error, or a
  same-named table backed by storage the writer cannot access) propagated and
  failed an already-committed, already-synced write.

Extract the invalidation into HoodieSparkSqlWriterInternal.refreshSparkCatalog
TableCache, keep the database-qualified name, and wrap each table's refresh in
a NonFatal try/catch that logs and continues. Cache invalidation is purely a
local-session optimization and must never fail a committed write.

Add TestSparkCatalogCacheRefresh covering both: refresh targets the sync db
(not a broken same-named default table) and a failing refresh does not throw.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR makes the post-sync Spark catalog cache refresh best-effort by wrapping per-table refreshes in NonFatal try/catch and preserving the database-qualified name, so a transient catalog error or a same-named table in another database no longer fails an already-committed write. The change is narrowly scoped, well-documented, and the regression tests cover both the wrong-database resolution and the swallow-on-failure paths. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 14, 2026
…fresh

Order org.apache.commons before org.apache.spark (same 3rdParty group,
alphabetical) and drop the blank line splitting the group, clearing the two
scalastyle violations that failed hudi-spark_2.12 at the compile phase and
cascaded the CI matrix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the fix! This makes the post-sync Spark catalog cache refresh best-effort and ensures the database-qualified name is used so a same-named table in default is never refreshed by mistake. The error handling layers (per-table inner catch + outer database-lookup catch, both NonFatal) look correct, and the tests cover both the wrong-database and swallow-on-failure paths. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 58.82353% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.63%. Comparing base (91f341f) to head (d98d640).
⚠️ Report is 144 commits behind head on master.

Files with missing lines Patch % Lines
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 58.82% 4 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18975      +/-   ##
============================================
- Coverage     68.07%   67.63%   -0.44%     
- Complexity    28943    29771     +828     
============================================
  Files          2519     2562      +43     
  Lines        140664   145168    +4504     
  Branches      17428    18337     +909     
============================================
+ Hits          95757    98190    +2433     
- Misses        37043    38754    +1711     
- Partials       7864     8224     +360     
Flag Coverage Δ
common-and-other-modules 44.77% <0.00%> (+0.43%) ⬆️
hadoop-mr-java-client 44.70% <ø> (-0.24%) ⬇️
spark-client-hadoop-common 48.11% <ø> (-0.32%) ⬇️
spark-java-tests 48.30% <11.76%> (-0.34%) ⬇️
spark-scala-tests 44.48% <58.82%> (-0.28%) ⬇️
utilities 37.22% <11.76%> (-0.46%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 78.59% <58.82%> (+0.12%) ⬆️

... and 413 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

// `default.refresh_t`. A buggy (unqualified) refresh would resolve `default.refresh_t`
// and throw here.
HoodieSparkSqlWriterInternal.refreshSparkCatalogTableCache(spark, syncDb, Seq(tableName))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case cannot catch a regression to an unqualified refresh. refreshSparkCatalogTableCache wraps each refresh in a NonFatal catch, so even if the name resolved unqualified to the broken default.refresh_t, the error would be logged and swallowed, not thrown - the call returns normally and the test still passes. That makes the comment "A buggy (unqualified) refresh would resolve default.refresh_t and throw here" inaccurate, and leaves case (1) doing the same no-throw check as case (2). To actually guard the qualified name, assert the target observably - e.g. pass a spied SparkSession and verify catalog.refreshTable is invoked with the syncDb-qualified name and never the bare table name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants