perf(kafka-connect): avoid per-record TopicPartition allocation in HoodieSinkTask.put by wombatu-kun · Pull Request #19018 · apache/hudi

wombatu-kun · 2026-06-16T07:30:12Z

Describe the issue this Pull Request addresses

HoodieSinkTask.put allocates a new TopicPartition(topic, partition) for every incoming record solely to look the record's participant up in the transactionParticipants map, then discards it. On a high-throughput sink this is one short-lived allocation per record. A JMH micro-benchmark confirms the allocation is real and is not eliminated by escape analysis.

Summary and Changelog

Maintain a secondary topic -> partition -> participant index alongside the existing transactionParticipants map, populated and cleared at the same lifecycle points (bootstrap, close, cleanup). Route records through this index in put() using a topic string lookup plus a partition int lookup (small ints are cached by the JVM), which removes the per-record TopicPartition allocation. The primary TopicPartition-keyed map is unchanged and still used by the assignment loop, preCommit, and partition close.

Impact

Performance only; no public API or behavior change. JMH micro-benchmark of routing one record to its participant (AverageTime mode, gc profiler):

Metric (per record)	Baseline (new TopicPartition)	After (nested map)
Time	11.76 ns/op	10.80 ns/op (-8%)
Allocations	24 B/op	~0 B/op

This is a small per-record win; at high record rates it removes roughly 24 B of garbage per record on the put() path. Benchmark code is not included in this PR.

Risk Level

low

Behavior-preserving routing refactor: the secondary index mirrors transactionParticipants and is maintained at the same points, and lookups are equivalent to the previous TopicPartition-keyed lookup, read on the single task thread. The full hudi-kafka-connect unit suite passes. HoodieSinkTask.put is not directly unit-tested, so the change is intentionally a minimal mirror of the existing lookup.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…odieSinkTask.put Route records through a secondary topic -> partition -> participant index instead of allocating a TopicPartition key per record. The index mirrors transactionParticipants and is maintained in bootstrap, close, and cleanup; the primary TopicPartition-keyed map is unchanged.

voonhous · 2026-06-16T09:13:32Z

The improvement here is marginal and there is a real risk is future maintainers breaking the two-map sync, which is a maintainability cost rather than a present-day bug.

Not sure if we should merge this in IMO. Would love to get a second opinion on this.

hudi-bot · 2026-06-16T10:59:16Z

CI report:

46dfe10 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR removes a per-record TopicPartition allocation in HoodieSinkTask.put() by maintaining a secondary topic -> partition -> participant index that mirrors transactionParticipants at the same lifecycle points (bootstrap, close, cleanup). The two maps stay in sync, and the lookup runs on the single Connect task thread so HashMap remains safe. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

voonhous added the area:performance Performance optimizations label Jun 16, 2026

github-actions Bot added the size:S PR with lines of changes in (10, 100] label Jun 16, 2026

hudi-agent reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(kafka-connect): avoid per-record TopicPartition allocation in HoodieSinkTask.put#19018

perf(kafka-connect): avoid per-record TopicPartition allocation in HoodieSinkTask.put#19018
wombatu-kun wants to merge 1 commit into
apache:masterfrom
wombatu-kun:perf/kafka-connect-avoid-per-record-topicpartition

wombatu-kun commented Jun 16, 2026

Uh oh!

voonhous commented Jun 16, 2026

Uh oh!

hudi-bot commented Jun 16, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wombatu-kun commented Jun 16, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

voonhous commented Jun 16, 2026

Uh oh!

hudi-bot commented Jun 16, 2026

CI report:

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants