feat(hive-sync): batch and parallelize HMS partition operations by nsivabalan · Pull Request #18983 · apache/hudi

nsivabalan · 2026-06-11T23:55:25Z

Summary

Adds opt-in IMetaStoreClientPool for HMS partition sync (issue #18331)
Batches ADD/TOUCH/UPDATE/DROP using the existing hoodie.datasource.hive_sync.batch_num (default 1000)
Fans batches out across N pooled RetryingMetaStoreClient instances via a fixed-size executor
Default off — existing behavior unchanged unless hoodie.datasource.hive_sync.batching.enabled=true

Design invariant

Only partition-row operations (add_partitions, alter_partitions, dropPartition, getPartition) go through the pool. Table-row operations (createTable, alter_table, updateLastCommitTimeSynced, updateHoodieWriterVersion, updateTableComments) stay on the existing session IMetaStoreClient held by HoodieHiveSyncClient. The sync flow is therefore serial → parallel → serial:

Table setup (single client)
Partition fan-out across the pool
Table finalization (single client)

This eliminates lost-update risk on Table.parameters (the read-modify-write pattern used by updateLastCommitTimeSynced and updateHoodieWriterVersion).

New configs

Key	Default	Purpose
`hoodie.datasource.hive_sync.batching.enabled`	`false`	Master feature flag
`hoodie.datasource.hive_sync.batching.threads`	`4`	Pool size + worker thread count

Reuses the existing hoodie.datasource.hive_sync.batch_num for batch sizing (no new batch-size config).

Scope

HMS executor only — matches the branch name hms-parallelize-calls. HiveQL and JDBC executor parallelism is deferred (per-thread SessionState/Driver and JDBC Connection pools are bigger changes; gating on benchmarks of this PR first).

Failure semantics

Today (sequential): batch N fails → N+1..K never run. Predictable prefix.
New (parallel): in-flight batches complete; first exception is thrown, remaining errors logged at WARN. Re-running sync is already idempotent (add_partitions(list, ifNotExists=true), partitionExists guard before dropPartition, idempotent alter_partitions), so partial-state retry behavior matches today.

Compatibility

When HIVE_SYNC_USE_SPARK_CATALOG=true, the pool path is skipped with a warning and we fall back to sequential — the reflection-built SparkCatalogMetaStoreClient isn't compatible with the direct RetryingMetaStoreClient.getProxy construction path.

Test plan

mvn compile on hudi-sync/hudi-hive-sync — clean, 0 Checkstyle violations, 0 RAT issues
mvn test on hudi-sync/hudi-hive-sync — 296 tests, 0 failures, 0 errors
TestIMetaStoreClientPool (new, 8 tests) — borrow/return on success, on failure, concurrent borrow bounded by pool size, idempotent close, executor lifecycle, size validation
TestHiveSyncTool#testHMSSyncWithBatchingEnabled (new) — end-to-end HMS sync with batching.enabled=true, threads=3, batch_num=3, 10 initial + 4 incremental partitions
Manual benchmark on a 2k-partition table (planned before flipping default; not blocking this PR since flag is opt-in)

Files touched

HiveSyncConfigHolder.java — 2 new ConfigProperty constants
HoodieHiveSyncClient.java — owns + closes the pool, builds it only for HMS mode with flag on
ddl/HMSDDLExecutor.java — new constructor accepting the pool, runBatches helper shared across all four partition methods
util/IMetaStoreClientPool.java — new; modeled on Iceberg's HiveClientPool but standalone (no Iceberg dep)
TestIMetaStoreClientPool.java — new; mock-based unit tests
TestHiveSyncTool.java — new end-to-end test method

Follow-ups (separate PRs)

Respond to issue [IMPROVEMENT] Hive Sync partition operations lack batching and parallelism, causing 4x-9x slowdown for large tables #18331 noting a minor inaccuracy: JDBC's DROP is already batched (JDBCExecutor.constructDropPartitions). The real JDBC gap is SET LOCATION (UPDATE), which is one statement per partition with no batching.
After benchmarks, decide whether HiveQL + JDBC parallelism is worth the additional per-thread SessionState/Connection complexity.

Related: #18331

🤖 Generated with Claude Code

Hive sync partition operations on HMS today serialize through a single IMetaStoreClient and ship entire partition lists in a single Thrift call for TOUCH/UPDATE. For large tables (~2k partitions) this is ~5-9x slower than parallel implementations (see apache#18331). The biggest contributors are (1) one giant alter_partitions call for UPDATE/TOUCH, and (2) per- partition Thrift round-trips for DROP, all sequential. This change introduces an opt-in IMetaStoreClientPool gated behind hoodie.datasource.hive_sync.batching.enabled (default false). When on, HMSDDLExecutor splits ADD / UPDATE / TOUCH / DROP into batches of hoodie.datasource.hive_sync.batch_num (existing config, default 1000) and fans them out across a pool of RetryingMetaStoreClient instances sized by hoodie.datasource.hive_sync.batching.threads (default 4). Design invariant: only partition-row operations go through the pool. Table-row operations (createTable, alter_table, last-commit-time-synced, writer-version, table-comments) stay on the existing session client, so there is no lost-update risk on table parameters. The sync flow remains serial-parallel-serial (phase 1: table setup, phase 2: parallel partition fan-out, phase 3: table finalization). Sequential fallback is preserved when the flag is off or when HIVE_SYNC_USE_SPARK_CATALOG is on (incompatible with the pool's direct RetryingMetaStoreClient.getProxy path). Tests: TestIMetaStoreClientPool covers borrow/return, concurrent borrows, close idempotency. TestHiveSyncTool.testHMSSyncWithBatchingEnabled exercises end-to-end sync against the embedded HMS with batching on. Related: apache#18331 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-06-12T01:07:14Z

Codecov Report

❌ Patch coverage is 78.48101% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.27%. Comparing base (8933224) to head (71946a6).
⚠️ Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
...rg/apache/hudi/hive/util/IMetaStoreClientPool.java	72.05%	16 Missing and 3 partials ⚠️
.../java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java	84.12%	7 Missing and 3 partials ⚠️
...ava/org/apache/hudi/hive/HoodieHiveSyncClient.java	66.66%	4 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18983      +/-   ##
============================================
+ Coverage     68.26%   68.27%   +0.01%     
- Complexity    29513    29561      +48     
============================================
  Files          2542     2543       +1     
  Lines        142627   142818     +191     
  Branches      17788    17810      +22     
============================================
+ Hits          97369    97516     +147     
- Misses        37253    37295      +42     
- Partials       8005     8007       +2

Flag	Coverage Δ
common-and-other-modules	`44.84% <78.48%> (+0.05%)`	⬆️
hadoop-mr-java-client	`44.77% <ø> (+0.02%)`	⬆️
spark-client-hadoop-common	`48.07% <ø> (+0.03%)`	⬆️
spark-java-tests	`48.75% <7.59%> (-0.02%)`	⬇️
spark-scala-tests	`44.82% <17.72%> (-0.03%)`	⬇️
utilities	`37.27% <7.59%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...ava/org/apache/hudi/hive/HiveSyncConfigHolder.java	`99.21% <100.00%> (+0.08%)`	⬆️
...ava/org/apache/hudi/hive/HoodieHiveSyncClient.java	`49.48% <66.66%> (+0.55%)`	⬆️
.../java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java	`80.86% <84.12%> (+0.51%)`	⬆️
...rg/apache/hudi/hive/util/IMetaStoreClientPool.java	`72.05% <72.05%> (ø)`

... and 33 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-06-12T01:32:57Z

CI report:

71946a6 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 12, 2026

nsivabalan marked this pull request as draft June 12, 2026 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hive-sync): batch and parallelize HMS partition operations#18983

feat(hive-sync): batch and parallelize HMS partition operations#18983
nsivabalan wants to merge 1 commit into
apache:masterfrom
nsivabalan:hms-parallelize-calls

nsivabalan commented Jun 11, 2026

Uh oh!

codecov-commenter commented Jun 12, 2026

Uh oh!

hudi-bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nsivabalan commented Jun 11, 2026

Summary

Design invariant

New configs

Scope

Failure semantics

Compatibility

Test plan

Files touched

Follow-ups (separate PRs)

Uh oh!

codecov-commenter commented Jun 12, 2026

Codecov Report

Uh oh!

hudi-bot commented Jun 12, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants