[HUDI-18139] fix(spark): make post-sync Spark catalog cache refresh best-effort#18975
[HUDI-18139] fix(spark): make post-sync Spark catalog cache refresh best-effort#18975ad1happy2go wants to merge 2 commits into
Conversation
…139) After a successful write + meta-sync, Hudi refreshes the Spark catalog relation cache for the synced table so later reads in the same session see the new data. That refresh re-resolves the table by name, which for a Hudi table eagerly reads its .hoodie metadata from storage. Two problems, reproduced on Spark 3.5/4.0: - An unqualified table name resolves against the session's current/`default` database, so a same-named table there (pointing at unrelated/inaccessible storage) is read by mistake. The name is already qualified with the sync database; this keeps that behavior. - The refresh was unguarded, so any failure (a transient catalog error, or a same-named table backed by storage the writer cannot access) propagated and failed an already-committed, already-synced write. Extract the invalidation into HoodieSparkSqlWriterInternal.refreshSparkCatalog TableCache, keep the database-qualified name, and wrap each table's refresh in a NonFatal try/catch that logs and continues. Cache invalidation is purely a local-session optimization and must never fail a committed write. Add TestSparkCatalogCacheRefresh covering both: refresh targets the sync db (not a broken same-named default table) and a failing refresh does not throw. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR makes the post-sync Spark catalog cache refresh best-effort by wrapping per-table refreshes in NonFatal try/catch and preserving the database-qualified name, so a transient catalog error or a same-named table in another database no longer fails an already-committed write. The change is narrowly scoped, well-documented, and the regression tests cover both the wrong-database resolution and the swallow-on-failure paths. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
…fresh Order org.apache.commons before org.apache.spark (same 3rdParty group, alphabetical) and drop the blank line splitting the group, clearing the two scalastyle violations that failed hudi-spark_2.12 at the compile phase and cascaded the CI matrix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the fix! This makes the post-sync Spark catalog cache refresh best-effort and ensures the database-qualified name is used so a same-named table in default is never refreshed by mistake. The error handling layers (per-table inner catch + outer database-lookup catch, both NonFatal) look correct, and the tests cover both the wrong-database and swallow-on-failure paths. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18975 +/- ##
============================================
- Coverage 68.07% 67.63% -0.44%
- Complexity 28943 29771 +828
============================================
Files 2519 2562 +43
Lines 140664 145168 +4504
Branches 17428 18337 +909
============================================
+ Hits 95757 98190 +2433
- Misses 37043 38754 +1711
- Partials 7864 8224 +360
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| // `default.refresh_t`. A buggy (unqualified) refresh would resolve `default.refresh_t` | ||
| // and throw here. | ||
| HoodieSparkSqlWriterInternal.refreshSparkCatalogTableCache(spark, syncDb, Seq(tableName)) | ||
|
|
There was a problem hiding this comment.
This case cannot catch a regression to an unqualified refresh. refreshSparkCatalogTableCache wraps each refresh in a NonFatal catch, so even if the name resolved unqualified to the broken default.refresh_t, the error would be logged and swallowed, not thrown - the call returns normally and the test still passes. That makes the comment "A buggy (unqualified) refresh would resolve default.refresh_t and throw here" inaccurate, and leaves case (1) doing the same no-throw check as case (2). To actually guard the qualified name, assert the target observably - e.g. pass a spied SparkSession and verify catalog.refreshTable is invoked with the syncDb-qualified name and never the bare table name.
Describe the issue this Pull Request addresses
After a successful write + meta-sync,
HoodieSparkSqlWriter.metaSyncrefreshes the Spark catalog relation cache for the synced table so later reads in the same session see the new data. That refresh re-resolves the table by name, which for a Hudi table eagerly reads its.hoodiemetadata from storage.Reported in #18139: with AWS Glue sync, the write and sync succeed, but the job then fails with
AccessDeniedwhile refreshing a same-named table that belongs to thedefaultdatabase (pointing at an unrelated, inaccessible bucket).Two problems, reproduced on Spark 3.5 / 4.0:
defaultdatabase, so a same-named table there is read by mistake. The name is already qualified with the sync database (hoodie.datasource.hive_sync.database); this change preserves that.Summary and Changelog
Cache invalidation after a write is purely a local-session optimization and must never fail a committed write. This change makes the post-sync Spark catalog cache refresh best-effort.
HoodieSparkSqlWriterInternal.refreshSparkCatalogTableCache.defaultdatabase - is never resolved and refreshed by mistake.NonFataltry/catch that logs at WARN and continues, with an outer guard around the database lookup.TestSparkCatalogCacheRefreshcovering both the wrong-database resolution and the swallow-on-failure paths.Impact
A failure while refreshing the Spark catalog cache after a successful write+sync no longer fails the write. The only behavioral change: if a refresh genuinely fails, a subsequent read in the same session may serve stale cached data (logged at WARN); the committed table is unaffected.
Risk Level
low
Behavior change is limited to the post-sync, best-effort catalog cache invalidation step.
Documentation Update
None required.
Contributor's checklist
🤖 Generated with Claude Code