feat(hive-sync): batch and parallelize JDBC partition operations by nsivabalan · Pull Request #18986 · apache/hudi

nsivabalan · 2026-06-12T04:02:00Z

Summary

Adds opt-in JDBCConnectionPool for JDBC partition sync (issue #18331)
Fans batched ALTER statements across N pooled java.sql.Connection instances
Default off — existing behavior unchanged unless hoodie.datasource.hive_sync.batching.enabled=true

Note: stacked on #18984 (which is stacked on #18983). The diff currently includes both upstream commits. Once #18983 and #18984 merge, this PR will rebase cleanly onto master and the diff will shrink to just the JDBC change (~581 added / 3 removed). Reviewing the top commit f2b079109192 in isolation gives the JDBC-only delta.

Design constraint

JDBC Connection is not thread-safe but is cheap to construct (one TCP socket to HiveServer2). So the design mirrors #18983's IMetaStoreClientPool more than #18984's HiveDriverPool — simple borrow/return with no thread-binding. Each pooled connection is its own HiveServer2 session.

What's actually new

The batching gaps for JDBC were small:

ADD — already batched in QueryBasedDDLExecutor.constructAddPartitions (parent class).
DROP — already batched in JDBCExecutor.constructDropPartitions.
TOUCH — became batched in feat(hive-sync): batch and parallelize HiveQL partition operations #18984's constructPartitionAlterStatements change.
SET_LOCATION (UPDATE) — one statement per partition. Hive SQL cannot batch this (no multi-partition SET LOCATION syntax).

So the win for JDBC is pure parallel execution of the existing batches and statements, not new batching.

Hive 2.x quirk handling

Same Hive 2.x USE database quirk we encountered in #18984: ALTER PARTITION SET LOCATION ignores db.tbl qualifiers and silently uses the connection's current database. Each pooled connection is its own HiveServer2 session, so each one needs its own USE database before any partition ALTER. The pool exposes runOnEachConnection(List<String>) for this; JDBCExecutor.runSQLs peels off any leading USE statements from the SQL list and broadcasts them to every pooled connection, then fans the rest round-robin.

This is done lazily on first dispatch (not at pool construction) because the database may not yet exist when the pool is built — HoodieHiveSyncClient constructs the pool before HiveSyncTool.syncHoodieTable() runs createDatabase. (I learned this the hard way during testing.)

Design invariant (same as #18983 / #18984)

Pool clients only do partition-row statements.
The session Connection on JDBCExecutor continues to handle table-row work: createDatabase, createTable, updateTableComments, schema evolution.
JDBCBasedMetadataOperator (used as the Thrift-incompatibility fallback) continues to use the same session Connection — unchanged.

Configs

No new configs — reuses everything from #18983:

Key	Default
`hoodie.datasource.hive_sync.batching.enabled`	`false`
`hoodie.datasource.hive_sync.batching.threads`	`4`
`hoodie.datasource.hive_sync.batch_num`	`1000`

Test plan

mvn compile on hudi-sync/hudi-hive-sync — clean, 0 Checkstyle violations, 0 RAT issues
mvn test on hudi-sync/hudi-hive-sync — 314 tests, 0 failures, 0 errors
TestJDBCConnectionPool (new, 8 tests) — borrow/return, concurrent-borrow bounding, idempotent close, exhaustion blocking, size validation, executor lifecycle
TestHiveSyncTool#testJDBCSyncWithBatchingEnabled (new) — end-to-end JDBC sync with batching on against the embedded HiveServer2 (10 + 4 partitions, batch_num=3, threads=3)
Existing 305 tests across all three sync modes pass unchanged
Manual benchmark on a ~1k-partition table (planned before flipping default; not blocking)

Stack so far

#18983 — HMS pool
#18984 — HiveQL pool (stacked on feat(hive-sync): batch and parallelize HMS partition operations #18983)
feat(hive-sync): batch and parallelize JDBC partition operations #18986 (this PR) — JDBC pool (stacked on feat(hive-sync): batch and parallelize HiveQL partition operations #18984)

After all three merge, the perf gap reported in #18331 should be closed for all sync modes.

Related: #18331

🤖 Generated with Claude Code

Hive sync partition operations on HMS today serialize through a single IMetaStoreClient and ship entire partition lists in a single Thrift call for TOUCH/UPDATE. For large tables (~2k partitions) this is ~5-9x slower than parallel implementations (see apache#18331). The biggest contributors are (1) one giant alter_partitions call for UPDATE/TOUCH, and (2) per- partition Thrift round-trips for DROP, all sequential. This change introduces an opt-in IMetaStoreClientPool gated behind hoodie.datasource.hive_sync.batching.enabled (default false). When on, HMSDDLExecutor splits ADD / UPDATE / TOUCH / DROP into batches of hoodie.datasource.hive_sync.batch_num (existing config, default 1000) and fans them out across a pool of RetryingMetaStoreClient instances sized by hoodie.datasource.hive_sync.batching.threads (default 4). Design invariant: only partition-row operations go through the pool. Table-row operations (createTable, alter_table, last-commit-time-synced, writer-version, table-comments) stay on the existing session client, so there is no lost-update risk on table parameters. The sync flow remains serial-parallel-serial (phase 1: table setup, phase 2: parallel partition fan-out, phase 3: table finalization). Sequential fallback is preserved when the flag is off or when HIVE_SYNC_USE_SPARK_CATALOG is on (incompatible with the pool's direct RetryingMetaStoreClient.getProxy path). Tests: TestIMetaStoreClientPool covers borrow/return, concurrent borrows, close idempotency. TestHiveSyncTool.testHMSSyncWithBatchingEnabled exercises end-to-end sync against the embedded HMS with batching on. Related: apache#18331 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Follow-up to apache#18983 (HMS parallelism). Applies the equivalent treatment to the HiveQL sync mode (hoodie.datasource.hive_sync.mode=hiveql). HiveQL had two issues that this change addresses: 1. Batching gaps in QueryBasedDDLExecutor.constructPartitionAlterStatements: TOUCH concatenated every partition into one giant ALTER TABLE ... TOUCH PARTITION (...) PARTITION (...) ... statement; SET_LOCATION (UPDATE) emitted one statement per partition. ADD was already batched. 2. Sequential SQL execution in HiveQueryDDLExecutor.updateHiveSQLs: even when batches existed, they ran in a single for-loop on one Hive Driver. This change introduces HiveDriverPool, an eager pool of single-thread executors each owning a Hive Driver bound to a shared SessionState. Gated behind the existing hoodie.datasource.hive_sync.batching.enabled flag (default off) and sized by hoodie.datasource.hive_sync.batching.threads (default 4) — no new configs. Design notes: - Hive's Driver and SessionState are thread-bound. SessionState.start() attaches to the calling thread's ThreadLocal. The pool gives each slot its own dedicated worker thread so the Driver stays valid for that thread's lifetime. Bootstrap, dispatch, and close all run on the bound thread. - SessionState is shared across workers (lazily constructed once), because each worker calls SessionState.start(sharedState) on its own thread to attach. Constructing one SessionState per worker triggered race conditions in Hive's resource-directory machinery on macOS. - TOUCH is now batched by HIVE_BATCH_SYNC_PARTITION_NUM. SET_LOCATION remains one statement per partition (Hive SQL doesn't support multi-partition SET LOCATION) but is now fanned out across workers. - Hive 2.x's ALTER PARTITION SET LOCATION ignores db.table qualifiers and silently uses the connection's current database, so the leading USE database statement is load-bearing. The pool peels it off and runs it on every worker via runOnEachWorker() before fanning the rest out. Tests: - TestHiveDriverPool: bootstrap, dispatch round-robin, error propagation, concurrent-borrow bounding, close idempotency. - TestHiveSyncTool.testHiveQLSyncWithBatchingEnabled: end-to-end with batching.enabled=true, threads=3, batch_num=3 against embedded HMS. - TestHiveSyncTool.testHiveQLTouchPartitionsWithBatching: exercises the batched TOUCH path specifically. - Full hudi-hive-sync suite: 305 passed, 0 failures, 0 errors. Related: apache#18331 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Follow-up to apache#18983 (HMS pool) and apache#18984 (HiveQL Driver pool). Applies the equivalent treatment to JDBC sync mode. The JDBC executor's partition phase ran every SQL statement sequentially on a single shared java.sql.Connection. ADD and DROP were already batched in the parent QueryBasedDDLExecutor (and in JDBCExecutor.constructDropPartitions); TOUCH became batched in apache#18984. The only batching gap that remains is SET_LOCATION (UPDATE), which Hive SQL fundamentally cannot batch (no multi-partition SET LOCATION syntax). So the win for JDBC is pure parallel execution. This change introduces JDBCConnectionPool, a simple borrow/return pool of N java.sql.Connection instances. JDBC Connection is not thread-safe but is cheap to construct (one TCP socket to HiveServer2), so the design mirrors IMetaStoreClientPool more than HiveDriverPool — no thread-binding needed. Gated behind the existing hoodie.datasource.hive_sync.batching.enabled flag (default false) and sized by hoodie.datasource.hive_sync.batching.threads (default 4) — no new configs. Design notes: - JDBCExecutor keeps its session Connection for table-row work (createDatabase, createTable, schema evolution, updateTableComments). JDBCBasedMetadataOperator continues to use that session Connection, unchanged. - The pool is purely additive for partition fan-out. Pool clients only run partition-row statements. - Hive 2.x's ALTER PARTITION SET LOCATION ignores db.table qualifiers and silently uses the connection's current database. The pool's runOnEachConnection() broadcasts the leading USE database statement to every connection before partition fan-out, same pattern as HiveDriverPool.runOnEachWorker() from apache#18984. This is run lazily on first dispatch, not at pool construction (the database may not exist yet at construction time). Tests: - TestJDBCConnectionPool: borrow/return, concurrent-borrow bounding, close idempotency, exhaustion blocking, size validation. - TestHiveSyncTool.testJDBCSyncWithBatchingEnabled: end-to-end JDBC sync with batching.enabled=true, threads=3, batch_num=3 against the embedded HiveServer2. - Full hudi-hive-sync suite: 314 passed, 0 failures. Related: apache#18331 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hudi-bot · 2026-06-12T05:41:04Z

CI report:

f2b0791 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-06-13T01:06:53Z

Codecov Report

❌ Patch coverage is 68.36735% with 155 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.25%. Comparing base (8933224) to head (f2b0791).
⚠️ Report is 14 commits behind head on master.

Files with missing lines	Patch %	Lines
.../org/apache/hudi/hive/util/JDBCConnectionPool.java	51.48%	46 Missing and 3 partials ⚠️
...java/org/apache/hudi/hive/util/HiveDriverPool.java	74.19%	29 Missing and 3 partials ⚠️
...in/java/org/apache/hudi/hive/ddl/JDBCExecutor.java	49.05%	21 Missing and 6 partials ⚠️
...rg/apache/hudi/hive/util/IMetaStoreClientPool.java	72.05%	16 Missing and 3 partials ⚠️
...org/apache/hudi/hive/ddl/HiveQueryDDLExecutor.java	54.16%	6 Missing and 5 partials ⚠️
.../java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java	84.12%	7 Missing and 3 partials ⚠️
...ava/org/apache/hudi/hive/HoodieHiveSyncClient.java	77.41%	6 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18986      +/-   ##
============================================
- Coverage     68.26%   68.25%   -0.02%     
- Complexity    29513    29564      +51     
============================================
  Files          2542     2545       +3     
  Lines        142627   142987     +360     
  Branches      17788    17850      +62     
============================================
+ Hits          97369    97593     +224     
- Misses        37253    37369     +116     
- Partials       8005     8025      +20

Flag	Coverage Δ
common-and-other-modules	`44.84% <68.36%> (+0.06%)`	⬆️
hadoop-mr-java-client	`44.75% <ø> (+<0.01%)`	⬆️
spark-client-hadoop-common	`48.06% <ø> (+0.01%)`	⬆️
spark-java-tests	`48.59% <2.44%> (-0.18%)`	⬇️
spark-scala-tests	`44.66% <5.71%> (-0.18%)`	⬇️
utilities	`37.11% <5.71%> (-0.15%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...ava/org/apache/hudi/hive/HiveSyncConfigHolder.java	`99.21% <100.00%> (+0.08%)`	⬆️
...rg/apache/hudi/hive/ddl/QueryBasedDDLExecutor.java	`87.12% <100.00%> (+1.07%)`	⬆️
...ava/org/apache/hudi/hive/HoodieHiveSyncClient.java	`50.75% <77.41%> (+1.82%)`	⬆️
.../java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java	`80.86% <84.12%> (+0.51%)`	⬆️
...org/apache/hudi/hive/ddl/HiveQueryDDLExecutor.java	`62.63% <54.16%> (-3.04%)`	⬇️
...rg/apache/hudi/hive/util/IMetaStoreClientPool.java	`72.05% <72.05%> (ø)`
...in/java/org/apache/hudi/hive/ddl/JDBCExecutor.java	`64.42% <49.05%> (-8.77%)`	⬇️
...java/org/apache/hudi/hive/util/HiveDriverPool.java	`74.19% <74.19%> (ø)`
.../org/apache/hudi/hive/util/JDBCConnectionPool.java	`51.48% <51.48%> (ø)`

... and 63 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nsivabalan and others added 3 commits June 11, 2026 16:54

nsivabalan mentioned this pull request Jun 12, 2026

[IMPROVEMENT] Hive Sync partition operations lack batching and parallelism, causing 4x-9x slowdown for large tables #18331

Open

github-actions Bot added the size:XL PR with lines of changes > 1000 label Jun 12, 2026

nsivabalan marked this pull request as draft June 12, 2026 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hive-sync): batch and parallelize JDBC partition operations#18986

feat(hive-sync): batch and parallelize JDBC partition operations#18986
nsivabalan wants to merge 3 commits into
apache:masterfrom
nsivabalan:jdbc-parallelize-calls

nsivabalan commented Jun 12, 2026 •

edited

Loading

Uh oh!

hudi-bot commented Jun 12, 2026

Uh oh!

codecov-commenter commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nsivabalan commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design constraint

What's actually new

Hive 2.x quirk handling

Design invariant (same as #18983 / #18984)

Configs

Test plan

Stack so far

Uh oh!

hudi-bot commented Jun 12, 2026

CI report:

Uh oh!

codecov-commenter commented Jun 13, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nsivabalan commented Jun 12, 2026 •

edited

Loading