Skip to content

feat(hive-sync): batch and parallelize JDBC partition operations#18986

Draft
nsivabalan wants to merge 3 commits into
apache:masterfrom
nsivabalan:jdbc-parallelize-calls
Draft

feat(hive-sync): batch and parallelize JDBC partition operations#18986
nsivabalan wants to merge 3 commits into
apache:masterfrom
nsivabalan:jdbc-parallelize-calls

Conversation

@nsivabalan

@nsivabalan nsivabalan commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds opt-in JDBCConnectionPool for JDBC partition sync (issue #18331)
  • Fans batched ALTER statements across N pooled java.sql.Connection instances
  • Default off — existing behavior unchanged unless hoodie.datasource.hive_sync.batching.enabled=true

Note: stacked on #18984 (which is stacked on #18983). The diff currently includes both upstream commits. Once #18983 and #18984 merge, this PR will rebase cleanly onto master and the diff will shrink to just the JDBC change (~581 added / 3 removed). Reviewing the top commit f2b079109192 in isolation gives the JDBC-only delta.

Design constraint

JDBC Connection is not thread-safe but is cheap to construct (one TCP socket to HiveServer2). So the design mirrors #18983's IMetaStoreClientPool more than #18984's HiveDriverPool — simple borrow/return with no thread-binding. Each pooled connection is its own HiveServer2 session.

What's actually new

The batching gaps for JDBC were small:

  • ADD — already batched in QueryBasedDDLExecutor.constructAddPartitions (parent class).
  • DROP — already batched in JDBCExecutor.constructDropPartitions.
  • TOUCH — became batched in feat(hive-sync): batch and parallelize HiveQL partition operations #18984's constructPartitionAlterStatements change.
  • SET_LOCATION (UPDATE) — one statement per partition. Hive SQL cannot batch this (no multi-partition SET LOCATION syntax).

So the win for JDBC is pure parallel execution of the existing batches and statements, not new batching.

Hive 2.x quirk handling

Same Hive 2.x USE database quirk we encountered in #18984: ALTER PARTITION SET LOCATION ignores db.tbl qualifiers and silently uses the connection's current database. Each pooled connection is its own HiveServer2 session, so each one needs its own USE database before any partition ALTER. The pool exposes runOnEachConnection(List<String>) for this; JDBCExecutor.runSQLs peels off any leading USE statements from the SQL list and broadcasts them to every pooled connection, then fans the rest round-robin.

This is done lazily on first dispatch (not at pool construction) because the database may not yet exist when the pool is built — HoodieHiveSyncClient constructs the pool before HiveSyncTool.syncHoodieTable() runs createDatabase. (I learned this the hard way during testing.)

Design invariant (same as #18983 / #18984)

  • Pool clients only do partition-row statements.
  • The session Connection on JDBCExecutor continues to handle table-row work: createDatabase, createTable, updateTableComments, schema evolution.
  • JDBCBasedMetadataOperator (used as the Thrift-incompatibility fallback) continues to use the same session Connection — unchanged.

Configs

No new configs — reuses everything from #18983:

Key Default
hoodie.datasource.hive_sync.batching.enabled false
hoodie.datasource.hive_sync.batching.threads 4
hoodie.datasource.hive_sync.batch_num 1000

Test plan

  • mvn compile on hudi-sync/hudi-hive-sync — clean, 0 Checkstyle violations, 0 RAT issues
  • mvn test on hudi-sync/hudi-hive-sync314 tests, 0 failures, 0 errors
  • TestJDBCConnectionPool (new, 8 tests) — borrow/return, concurrent-borrow bounding, idempotent close, exhaustion blocking, size validation, executor lifecycle
  • TestHiveSyncTool#testJDBCSyncWithBatchingEnabled (new) — end-to-end JDBC sync with batching on against the embedded HiveServer2 (10 + 4 partitions, batch_num=3, threads=3)
  • Existing 305 tests across all three sync modes pass unchanged
  • Manual benchmark on a ~1k-partition table (planned before flipping default; not blocking)

Stack so far

After all three merge, the perf gap reported in #18331 should be closed for all sync modes.

Related: #18331

🤖 Generated with Claude Code

nsivabalan and others added 3 commits June 11, 2026 16:54
Hive sync partition operations on HMS today serialize through a single
IMetaStoreClient and ship entire partition lists in a single Thrift call
for TOUCH/UPDATE. For large tables (~2k partitions) this is ~5-9x slower
than parallel implementations (see apache#18331). The biggest contributors are
(1) one giant alter_partitions call for UPDATE/TOUCH, and (2) per-
partition Thrift round-trips for DROP, all sequential.

This change introduces an opt-in IMetaStoreClientPool gated behind
hoodie.datasource.hive_sync.batching.enabled (default false). When on,
HMSDDLExecutor splits ADD / UPDATE / TOUCH / DROP into batches of
hoodie.datasource.hive_sync.batch_num (existing config, default 1000)
and fans them out across a pool of RetryingMetaStoreClient instances
sized by hoodie.datasource.hive_sync.batching.threads (default 4).

Design invariant: only partition-row operations go through the pool.
Table-row operations (createTable, alter_table, last-commit-time-synced,
writer-version, table-comments) stay on the existing session client, so
there is no lost-update risk on table parameters. The sync flow remains
serial-parallel-serial (phase 1: table setup, phase 2: parallel
partition fan-out, phase 3: table finalization).

Sequential fallback is preserved when the flag is off or when
HIVE_SYNC_USE_SPARK_CATALOG is on (incompatible with the pool's direct
RetryingMetaStoreClient.getProxy path).

Tests: TestIMetaStoreClientPool covers borrow/return, concurrent
borrows, close idempotency. TestHiveSyncTool.testHMSSyncWithBatchingEnabled
exercises end-to-end sync against the embedded HMS with batching on.

Related: apache#18331

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to apache#18983 (HMS parallelism). Applies the equivalent treatment to
the HiveQL sync mode (hoodie.datasource.hive_sync.mode=hiveql).

HiveQL had two issues that this change addresses:

1. Batching gaps in QueryBasedDDLExecutor.constructPartitionAlterStatements:
   TOUCH concatenated every partition into one giant ALTER TABLE ... TOUCH
   PARTITION (...) PARTITION (...) ... statement; SET_LOCATION (UPDATE)
   emitted one statement per partition. ADD was already batched.

2. Sequential SQL execution in HiveQueryDDLExecutor.updateHiveSQLs: even
   when batches existed, they ran in a single for-loop on one Hive Driver.

This change introduces HiveDriverPool, an eager pool of single-thread
executors each owning a Hive Driver bound to a shared SessionState.
Gated behind the existing hoodie.datasource.hive_sync.batching.enabled
flag (default off) and sized by hoodie.datasource.hive_sync.batching.threads
(default 4) — no new configs.

Design notes:
- Hive's Driver and SessionState are thread-bound. SessionState.start()
  attaches to the calling thread's ThreadLocal. The pool gives each slot
  its own dedicated worker thread so the Driver stays valid for that
  thread's lifetime. Bootstrap, dispatch, and close all run on the bound
  thread.
- SessionState is shared across workers (lazily constructed once),
  because each worker calls SessionState.start(sharedState) on its own
  thread to attach. Constructing one SessionState per worker triggered
  race conditions in Hive's resource-directory machinery on macOS.
- TOUCH is now batched by HIVE_BATCH_SYNC_PARTITION_NUM. SET_LOCATION
  remains one statement per partition (Hive SQL doesn't support
  multi-partition SET LOCATION) but is now fanned out across workers.
- Hive 2.x's ALTER PARTITION SET LOCATION ignores db.table qualifiers
  and silently uses the connection's current database, so the leading
  USE database statement is load-bearing. The pool peels it off and
  runs it on every worker via runOnEachWorker() before fanning the
  rest out.

Tests:
- TestHiveDriverPool: bootstrap, dispatch round-robin, error
  propagation, concurrent-borrow bounding, close idempotency.
- TestHiveSyncTool.testHiveQLSyncWithBatchingEnabled: end-to-end with
  batching.enabled=true, threads=3, batch_num=3 against embedded HMS.
- TestHiveSyncTool.testHiveQLTouchPartitionsWithBatching: exercises
  the batched TOUCH path specifically.
- Full hudi-hive-sync suite: 305 passed, 0 failures, 0 errors.

Related: apache#18331

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to apache#18983 (HMS pool) and apache#18984 (HiveQL Driver pool). Applies
the equivalent treatment to JDBC sync mode.

The JDBC executor's partition phase ran every SQL statement sequentially
on a single shared java.sql.Connection. ADD and DROP were already batched
in the parent QueryBasedDDLExecutor (and in JDBCExecutor.constructDropPartitions);
TOUCH became batched in apache#18984. The only batching gap that remains is
SET_LOCATION (UPDATE), which Hive SQL fundamentally cannot batch (no
multi-partition SET LOCATION syntax). So the win for JDBC is pure parallel
execution.

This change introduces JDBCConnectionPool, a simple borrow/return pool of
N java.sql.Connection instances. JDBC Connection is not thread-safe but
is cheap to construct (one TCP socket to HiveServer2), so the design
mirrors IMetaStoreClientPool more than HiveDriverPool — no thread-binding
needed.

Gated behind the existing hoodie.datasource.hive_sync.batching.enabled
flag (default false) and sized by hoodie.datasource.hive_sync.batching.threads
(default 4) — no new configs.

Design notes:
- JDBCExecutor keeps its session Connection for table-row work
  (createDatabase, createTable, schema evolution, updateTableComments).
  JDBCBasedMetadataOperator continues to use that session Connection,
  unchanged.
- The pool is purely additive for partition fan-out. Pool clients only
  run partition-row statements.
- Hive 2.x's ALTER PARTITION SET LOCATION ignores db.table qualifiers
  and silently uses the connection's current database. The pool's
  runOnEachConnection() broadcasts the leading USE database statement
  to every connection before partition fan-out, same pattern as
  HiveDriverPool.runOnEachWorker() from apache#18984. This is run lazily on
  first dispatch, not at pool construction (the database may not exist
  yet at construction time).

Tests:
- TestJDBCConnectionPool: borrow/return, concurrent-borrow bounding,
  close idempotency, exhaustion blocking, size validation.
- TestHiveSyncTool.testJDBCSyncWithBatchingEnabled: end-to-end JDBC
  sync with batching.enabled=true, threads=3, batch_num=3 against the
  embedded HiveServer2.
- Full hudi-hive-sync suite: 314 passed, 0 failures.

Related: apache#18331

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan marked this pull request as draft June 12, 2026 15:12
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 68.36735% with 155 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.25%. Comparing base (8933224) to head (f2b0791).
⚠️ Report is 14 commits behind head on master.

Files with missing lines Patch % Lines
.../org/apache/hudi/hive/util/JDBCConnectionPool.java 51.48% 46 Missing and 3 partials ⚠️
...java/org/apache/hudi/hive/util/HiveDriverPool.java 74.19% 29 Missing and 3 partials ⚠️
...in/java/org/apache/hudi/hive/ddl/JDBCExecutor.java 49.05% 21 Missing and 6 partials ⚠️
...rg/apache/hudi/hive/util/IMetaStoreClientPool.java 72.05% 16 Missing and 3 partials ⚠️
...org/apache/hudi/hive/ddl/HiveQueryDDLExecutor.java 54.16% 6 Missing and 5 partials ⚠️
.../java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java 84.12% 7 Missing and 3 partials ⚠️
...ava/org/apache/hudi/hive/HoodieHiveSyncClient.java 77.41% 6 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18986      +/-   ##
============================================
- Coverage     68.26%   68.25%   -0.02%     
- Complexity    29513    29564      +51     
============================================
  Files          2542     2545       +3     
  Lines        142627   142987     +360     
  Branches      17788    17850      +62     
============================================
+ Hits          97369    97593     +224     
- Misses        37253    37369     +116     
- Partials       8005     8025      +20     
Flag Coverage Δ
common-and-other-modules 44.84% <68.36%> (+0.06%) ⬆️
hadoop-mr-java-client 44.75% <ø> (+<0.01%) ⬆️
spark-client-hadoop-common 48.06% <ø> (+0.01%) ⬆️
spark-java-tests 48.59% <2.44%> (-0.18%) ⬇️
spark-scala-tests 44.66% <5.71%> (-0.18%) ⬇️
utilities 37.11% <5.71%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ava/org/apache/hudi/hive/HiveSyncConfigHolder.java 99.21% <100.00%> (+0.08%) ⬆️
...rg/apache/hudi/hive/ddl/QueryBasedDDLExecutor.java 87.12% <100.00%> (+1.07%) ⬆️
...ava/org/apache/hudi/hive/HoodieHiveSyncClient.java 50.75% <77.41%> (+1.82%) ⬆️
.../java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java 80.86% <84.12%> (+0.51%) ⬆️
...org/apache/hudi/hive/ddl/HiveQueryDDLExecutor.java 62.63% <54.16%> (-3.04%) ⬇️
...rg/apache/hudi/hive/util/IMetaStoreClientPool.java 72.05% <72.05%> (ø)
...in/java/org/apache/hudi/hive/ddl/JDBCExecutor.java 64.42% <49.05%> (-8.77%) ⬇️
...java/org/apache/hudi/hive/util/HiveDriverPool.java 74.19% <74.19%> (ø)
.../org/apache/hudi/hive/util/JDBCConnectionPool.java 51.48% <51.48%> (ø)

... and 63 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants