perf(spark): restore Catalyst fast-paths for SimpleKeyGen/Nonpartitioned bulk insert by nsivabalan · Pull Request #18990 · apache/hudi

nsivabalan · 2026-06-12T21:10:03Z

Describe the issue this Pull Request addresses

Summary and Changelog

HoodieDatasetBulkInsertHelper.prepareForBulkInsert was routing every key-generator through df.queryExecution.toRdd.mapPartitions(...), forcing an RDD round-trip and per-row reflection-based keygen invocation even for the common keygens where the record-key and partition-path values can be sourced directly from input columns.

This patch restores tiered dispatch:

Tier 1 — NonpartitionedKeyGenerator (single record-key field): emits col(rk).cast(String) + lit("") as Catalyst columns. No UDF, no toRdd round-trip.
Tier 2 — SimpleKeyGenerator (single record-key + single partition-path field, URL-encoding off, slash-separated dates off): emits col(rk).cast(String) and a partition-path expression mirroring PartitionPathFormatterBase#combine, including the handleEmpty -> __HIVE_DEFAULT_PARTITION__ substitution and hive-style <field>= prefixing.
Tier 3 — everything else (multi-field keys, ComplexKeyGenerator, TimestampBasedKeyGenerator, CustomKeyGenerator, SimpleKeyGenerator with URL-encode or slash-separated dates): anonymous functions.udf(...) over a struct of input columns calling the canonical BuiltinKeyGenerator.getRecordKey(Row) / getPartitionPath(Row). The UDFs are not registered against the SparkSession, so nothing leaks across writes.
Auto-record-key generation keeps the existing RDD path; it needs TaskContext.partitionId and a stateful per-task counter, which can't be expressed cleanly as a driver-side closure.

The Tier 3 UDF goes through the Row-overload keygen API which uses the canonical String formatter, so all three partition-formatter flags (hive-style, URL encode, slash-separated dates) remain honored for the keygens that fall through. The Tier 2 fast-path encodes only the default and hive-style flag subset (URL encoding has no efficient pure-Catalyst equivalent; the 1.2.0+ slash-separated branch exercises a separate code path we'd rather not encode twice).

New tests in TestHoodieDatasetBulkInsertHelper:

testKeyGenParityAgainstAvroGroundTruth (parameterized, 11 cases) — every supported keygen class plus the SimpleKeyGen flag combos (default / hive / slash / hive+slash / URL / hive+URL / Complex single+multi / TimestampBased / Custom). Each case asserts the helper's record-key and partition-path output matches BuiltinKeyGenerator's Avro path byte-for-byte.
testFastPathCastsNonStringRecordKey — Tier 1/2 must materialize the string form of a non-string record-key column (uses ts: long).
testFastPathAvoidsUdf — Tier 1/2 analyzed logical plans must not contain a ScalaUDF node (i.e. they actually benefit from Catalyst codegen).
testTier2EmptyPartitionValueSubstitutedWithHiveDefault — empty partition values resolve to __HIVE_DEFAULT_PARTITION__ under both default and hive-style flags.
testUdfPathRespectsDriverSessionTimezone — Tier 3 UDF picks up the driver's spark.sql.session.timeZone (guards against executor JVM default leakage on TimestampBasedKeyGenerator).

Impact

Performance: restores per-row Catalyst codegen for bulk inserts that use NonpartitionedKeyGenerator or SimpleKeyGenerator (with default or hive-style partitioning) — the most common configurations in practice. No behaviour change for the keygens that fall through to Tier 3; their output is byte-identical to the prior RDD path (and to the Avro ground truth, which the parity test now enforces).

No public API change. No config change. No storage format change.

Risk Level

Low. The change is contained to HoodieDatasetBulkInsertHelper.prepareForBulkInsert (Scala helper, no public API surface) and the parity test exhaustively checks every keygen + formatter combination against the canonical Avro keygen output. The Tier 3 fallback is the existing RDD-replaced UDF path, so any keygen the fast paths don't claim continues to use the same canonical formatter.

Documentation Update

None.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added

🤖 Generated with Claude Code

…ned bulk insert The post-apache#5470 HoodieDatasetBulkInsertHelper rewrite routed every keygen through df.queryExecution.toRdd.mapPartitions(...) which materialises rows and pays a per-row keygen reflection cost. Restore tiered dispatch so the two common keygen classes stay as Catalyst column projections: - Tier 1 (NonpartitionedKeyGenerator, single record-key field): col(rk).cast(String) + lit("") -- no UDF, no toRdd round-trip. - Tier 2 (SimpleKeyGenerator, single record-key + single partition-path, default URL/slash flags): col(rk).cast(String) plus a partition-path expression mirroring PartitionPathFormatterBase#combine including the handleEmpty -> __HIVE_DEFAULT_PARTITION__ substitution, with hive-style `<field>=` prefixing emitted as `concat(lit, ...)`. urlEncode and the 1.2.0+ slashSeparatedDatePartitioning flag drop to Tier 3. - Tier 3 (everything else): anonymous functions.udf(...) -- registered per call, not on the SparkSession, so nothing leaks across writes. - Auto-keygen keeps the existing RDD path (needs TaskContext partitionId and a stateful row counter -- no clean Catalyst expression). The Tier 3 UDF goes through BuiltinKeyGenerator.getRecordKey(Row) / getPartitionPath(Row), i.e. the canonical Avro-aligned formatters, so all three formatter flags (hive-style, URL encode, slash-separated dates) are honoured for the keygens that fall through. Test coverage in TestHoodieDatasetBulkInsertHelper: - testKeyGenParityAgainstAvroGroundTruth (11 parameterised cases) -- one per tier-relevant keygen + formatter-flag combo, compares the helper's record-key / partition-path output to BuiltinKeyGenerator's Avro path. - testFastPathCastsNonStringRecordKey -- Tier 1/2 cast a non-string record-key column (ts: long) to the expected string form. - testFastPathAvoidsUdf -- Tier 1/2 analyzed plans contain no ScalaUDF (i.e. they actually benefit from Catalyst codegen). - testTier2EmptyPartitionValueSubstitutedWithHiveDefault -- empty partition values resolve to __HIVE_DEFAULT_PARTITION__ under both default and hive-style flags. - testUdfPathRespectsDriverSessionTimezone -- the Tier 3 UDF path picks up the driver's spark.sql.session.timeZone (guards against executor JVM default leakage on TimestampBasedKeyGenerator). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hudi-bot · 2026-06-12T21:47:16Z

CI report:

97697ee Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-06-12T22:23:07Z

Codecov Report

❌ Patch coverage is 92.10526% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.94%. Comparing base (cb14ca7) to head (97697ee).
⚠️ Report is 26 commits behind head on master.

Files with missing lines	Patch %	Lines
...rg/apache/hudi/HoodieDatasetBulkInsertHelper.scala	92.10%	1 Missing and 5 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18990      +/-   ##
============================================
- Coverage     68.77%   67.94%   -0.83%     
- Complexity    29073    29249     +176     
============================================
  Files          2517     2542      +25     
  Lines        139807   142597    +2790     
  Branches      17210    17811     +601     
============================================
+ Hits          96148    96890     +742     
- Misses        35885    37716    +1831     
- Partials       7774     7991     +217

Flag	Coverage Δ
common-and-other-modules	`44.75% <0.00%> (+0.49%)`	⬆️
hadoop-mr-java-client	`44.76% <ø> (+0.07%)`	⬆️
spark-client-hadoop-common	`48.02% <0.00%> (-0.02%)`	⬇️
spark-java-tests	`47.64% <92.10%> (-1.66%)`	⬇️
spark-scala-tests	`44.51% <73.68%> (-0.71%)`	⬇️
utilities	`37.25% <55.26%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...rg/apache/hudi/HoodieDatasetBulkInsertHelper.scala	`91.44% <92.10%> (-0.93%)`	⬇️

... and 220 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR restores Catalyst fast-paths for the two most common keygens (Nonpartitioned + Simple) and keeps everything else on an anonymous-UDF Tier 3, with a parity test against the Avro ground truth covering the supported flag combos. A couple of edge cases worth double-checking in the inline comments — mostly around the Tier 2 dispatch when partition columns aren't strings, and the class-name comparison missing the *AvroKeyGenerator variants. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small naming and assertion-style nits below.

hudi-agent · 2026-06-14T01:09:39Z

+
+    val tier2 = keyGeneratorClassName == classOf[SimpleKeyGenerator].getName &&
+      recordKeyFields.size == 1 && partitionPathFields.size == 1 &&
+      !urlEncode && !slashSep


🤖 The Tier 2 gate checks the keygen class and formatter flags but not the partition column's type. For a Timestamp partition column, col(field).cast(StringType) formats differently than the canonical Timestamp.toString() that combinePartitionPath uses (e.g. Spark's cast drops trailing zeros and uses session timezone, Timestamp.toString keeps fractional seconds and uses JVM default tz). Same concern applies to Tier 1 / the recordKeyCol cast. The parity test only exercises a String partition column so this divergence wouldn't be caught — could you confirm the type-mismatch case is intentional (matches the pre-#5470 fast path) and maybe add a StringType guard or test coverage for a Timestamp partition?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-14T01:09:39Z

+    val slashSep = typedProps.getBoolean(KeyGeneratorOptions.SLASH_SEPARATED_DATE_PARTITIONING.key,
+      KeyGeneratorOptions.SLASH_SEPARATED_DATE_PARTITIONING.defaultValue.toBoolean)
+
+    val tier1 = keyGeneratorClassName == classOf[NonpartitionedKeyGenerator].getName &&


🤖 The class-name comparison uses the user-supplied keyGeneratorClassName directly, so users who configured the Avro variant (SimpleAvroKeyGenerator / NonpartitionedAvroKeyGenerator) won't hit Tier 1/2 even though the factory converts them to the Spark variant at load time. Could compare against HoodieSparkKeyGeneratorFactory.convertToSparkKeyGenerator(keyGeneratorClassName) here. Not a correctness issue — Tier 3 still works for those — just a missed optimization for that config path.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-14T01:09:39Z

+        // Tier 1/2 do not reproduce the canonical keygen's null/empty record-key validation:
+        // a null record-key value passes through as SQL NULL instead of throwing HoodieKeyException.
+        // This matches the behaviour of the pre-existing fast paths in this helper prior to the
+        // RDD-based rewrite. Tier 3 retains the strict canonical behaviour.


🤖 The comment notes that Tier 1/2 pass NULL record-keys through instead of throwing, but the divergence also covers empty-string record keys — SimpleKeyGenerator.getRecordKey(Row) calls requireNonNullNonEmptyKey which throws on both null and empty. Worth mentioning explicitly so future readers don't assume empty strings still validate.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-14T01:09:39Z

+    Dataset<Row> result = HoodieDatasetBulkInsertHelper.prepareForBulkInsert(dataset, config,
+        new HoodieTableConfig(), new NonSortPartitionerWithRows(), "000000001");
+    String plan = result.queryExecution().analyzed().toString();
+    assertTrue(!plan.toLowerCase().contains("scalaudf"),


🤖 nit: assertTrue(!condition) reads a bit awkwardly here — assertFalse(plan.toLowerCase().contains("scalaudf"), ...) is more idiomatic and produces a slightly clearer failure message.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

nsivabalan and others added 2 commits June 4, 2026 13:06

Attempting to revert the rdd usage in row writer code paths

d107503

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 12, 2026

voonhous added the area:performance Performance optimizations label Jun 13, 2026

hudi-agent reviewed Jun 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(spark): restore Catalyst fast-paths for SimpleKeyGen/Nonpartitioned bulk insert#18990

perf(spark): restore Catalyst fast-paths for SimpleKeyGen/Nonpartitioned bulk insert#18990
nsivabalan wants to merge 2 commits into
apache:masterfrom
nsivabalan:rowWriterPerfOptimization

nsivabalan commented Jun 12, 2026

Uh oh!

hudi-bot commented Jun 12, 2026

Uh oh!

codecov-commenter commented Jun 12, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Jun 14, 2026

Uh oh!

hudi-agent Jun 14, 2026

Uh oh!

hudi-agent Jun 14, 2026

Uh oh!

hudi-agent Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

nsivabalan commented Jun 12, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Jun 12, 2026

CI report:

Uh oh!

codecov-commenter commented Jun 12, 2026

Codecov Report

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants