[HUDI-18139] fix(spark): make post-sync Spark catalog cache refresh best-effort by ad1happy2go · Pull Request #18975 · apache/hudi

ad1happy2go · 2026-06-11T07:43:40Z

Describe the issue this Pull Request addresses

After a successful write + meta-sync, HoodieSparkSqlWriter.metaSync refreshes the Spark catalog relation cache for the synced table so later reads in the same session see the new data. That refresh re-resolves the table by name, which for a Hudi table eagerly reads its .hoodie metadata from storage.

Reported in #18139: with AWS Glue sync, the write and sync succeed, but the job then fails with AccessDenied while refreshing a same-named table that belongs to the default database (pointing at an unrelated, inaccessible bucket).

Two problems, reproduced on Spark 3.5 / 4.0:

An unqualified table name resolves against the session's current/default database, so a same-named table there is read by mistake. The name is already qualified with the sync database (hoodie.datasource.hive_sync.database); this change preserves that.
The refresh was unguarded, so any failure (a transient catalog error, or a same-named table backed by storage the writer cannot access) propagated and failed an already-committed, already-synced write.

Summary and Changelog

Cache invalidation after a write is purely a local-session optimization and must never fail a committed write. This change makes the post-sync Spark catalog cache refresh best-effort.

Extract the invalidation into HoodieSparkSqlWriterInternal.refreshSparkCatalogTableCache.
Keep the database-qualified name so a same-named table in another database - in particular the session's current/default database - is never resolved and refreshed by mistake.
Wrap each table's refresh in a NonFatal try/catch that logs at WARN and continues, with an outer guard around the database lookup.
Add TestSparkCatalogCacheRefresh covering both the wrong-database resolution and the swallow-on-failure paths.

Impact

A failure while refreshing the Spark catalog cache after a successful write+sync no longer fails the write. The only behavioral change: if a refresh genuinely fails, a subsequent read in the same session may serve stale cached data (logged at WARN); the committed table is unaffected.

Risk Level

low

Behavior change is limited to the post-sync, best-effort catalog cache invalidation step.

Documentation Update

None required.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable

🤖 Generated with Claude Code

…139) After a successful write + meta-sync, Hudi refreshes the Spark catalog relation cache for the synced table so later reads in the same session see the new data. That refresh re-resolves the table by name, which for a Hudi table eagerly reads its .hoodie metadata from storage. Two problems, reproduced on Spark 3.5/4.0: - An unqualified table name resolves against the session's current/`default` database, so a same-named table there (pointing at unrelated/inaccessible storage) is read by mistake. The name is already qualified with the sync database; this keeps that behavior. - The refresh was unguarded, so any failure (a transient catalog error, or a same-named table backed by storage the writer cannot access) propagated and failed an already-committed, already-synced write. Extract the invalidation into HoodieSparkSqlWriterInternal.refreshSparkCatalog TableCache, keep the database-qualified name, and wrap each table's refresh in a NonFatal try/catch that logs and continues. Cache invalidation is purely a local-session optimization and must never fail a committed write. Add TestSparkCatalogCacheRefresh covering both: refresh targets the sync db (not a broken same-named default table) and a failing refresh does not throw. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR makes the post-sync Spark catalog cache refresh best-effort by wrapping per-table refreshes in NonFatal try/catch and preserving the database-qualified name, so a transient catalog error or a same-named table in another database no longer fails an already-committed write. The change is narrowly scoped, well-documented, and the regression tests cover both the wrong-database resolution and the swallow-on-failure paths. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

…fresh Order org.apache.commons before org.apache.spark (same 3rdParty group, alphabetical) and drop the blank line splitting the group, clearing the two scalastyle violations that failed hudi-spark_2.12 at the compile phase and cascaded the CI matrix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the fix! This makes the post-sync Spark catalog cache refresh best-effort and ensures the database-qualified name is used so a same-named table in default is never refreshed by mistake. The error handling layers (per-table inner catch + outer database-lookup catch, both NonFatal) look correct, and the tests cover both the wrong-database and swallow-on-failure paths. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

codecov-commenter · 2026-06-15T04:09:34Z

Codecov Report

❌ Patch coverage is 58.82353% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.63%. Comparing base (91f341f) to head (d98d640).
⚠️ Report is 144 commits behind head on master.

Files with missing lines	Patch %	Lines
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	58.82%	4 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18975      +/-   ##
============================================
- Coverage     68.07%   67.63%   -0.44%     
- Complexity    28943    29771     +828     
============================================
  Files          2519     2562      +43     
  Lines        140664   145168    +4504     
  Branches      17428    18337     +909     
============================================
+ Hits          95757    98190    +2433     
- Misses        37043    38754    +1711     
- Partials       7864     8224     +360

Flag	Coverage Δ
common-and-other-modules	`44.77% <0.00%> (+0.43%)`	⬆️
hadoop-mr-java-client	`44.70% <ø> (-0.24%)`	⬇️
spark-client-hadoop-common	`48.11% <ø> (-0.32%)`	⬇️
spark-java-tests	`48.30% <11.76%> (-0.34%)`	⬇️
spark-scala-tests	`44.48% <58.82%> (-0.28%)`	⬇️
utilities	`37.22% <11.76%> (-0.46%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`78.59% <58.82%> (+0.12%)`	⬆️

... and 413 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-06-15T04:25:45Z

CI report:

d98d640 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

wombatu-kun · 2026-06-15T06:16:24Z

+        //     `default.refresh_t`. A buggy (unqualified) refresh would resolve `default.refresh_t`
+        //     and throw here.
+        HoodieSparkSqlWriterInternal.refreshSparkCatalogTableCache(spark, syncDb, Seq(tableName))
+


This case cannot catch a regression to an unqualified refresh. refreshSparkCatalogTableCache wraps each refresh in a NonFatal catch, so even if the name resolved unqualified to the broken default.refresh_t, the error would be logged and swallowed, not thrown - the call returns normally and the test still passes. That makes the comment "A buggy (unqualified) refresh would resolve default.refresh_t and throw here" inaccurate, and leaves case (1) doing the same no-throw check as case (2). To actually guard the qualified name, assert the target observably - e.g. pass a spied SparkSession and verify catalog.refreshTable is invoked with the syncDb-qualified name and never the bare table name.

hudi-agent reviewed Jun 14, 2026

View reviewed changes

github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 14, 2026

hudi-agent reviewed Jun 15, 2026

View reviewed changes

wombatu-kun reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-18139] fix(spark): make post-sync Spark catalog cache refresh best-effort#18975

[HUDI-18139] fix(spark): make post-sync Spark catalog cache refresh best-effort#18975
ad1happy2go wants to merge 2 commits into
apache:masterfrom
ad1happy2go:HUDI-18139-catalog-refresh-database

ad1happy2go commented Jun 11, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent left a comment

Uh oh!

codecov-commenter commented Jun 15, 2026

Uh oh!

hudi-bot commented Jun 15, 2026

Uh oh!

wombatu-kun Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ad1happy2go commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 15, 2026

Codecov Report

Uh oh!

hudi-bot commented Jun 15, 2026

CI report:

Uh oh!

wombatu-kun Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ad1happy2go commented Jun 11, 2026 •

edited

Loading