Skip to content

perf(kafka-connect): avoid per-record TopicPartition allocation in HoodieSinkTask.put#19018

Open
wombatu-kun wants to merge 1 commit into
apache:masterfrom
wombatu-kun:perf/kafka-connect-avoid-per-record-topicpartition
Open

perf(kafka-connect): avoid per-record TopicPartition allocation in HoodieSinkTask.put#19018
wombatu-kun wants to merge 1 commit into
apache:masterfrom
wombatu-kun:perf/kafka-connect-avoid-per-record-topicpartition

Conversation

@wombatu-kun

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

HoodieSinkTask.put allocates a new TopicPartition(topic, partition) for every incoming record solely to look the record's participant up in the transactionParticipants map, then discards it. On a high-throughput sink this is one short-lived allocation per record. A JMH micro-benchmark confirms the allocation is real and is not eliminated by escape analysis.

Summary and Changelog

Maintain a secondary topic -> partition -> participant index alongside the existing transactionParticipants map, populated and cleared at the same lifecycle points (bootstrap, close, cleanup). Route records through this index in put() using a topic string lookup plus a partition int lookup (small ints are cached by the JVM), which removes the per-record TopicPartition allocation. The primary TopicPartition-keyed map is unchanged and still used by the assignment loop, preCommit, and partition close.

Impact

Performance only; no public API or behavior change. JMH micro-benchmark of routing one record to its participant (AverageTime mode, gc profiler):

Metric (per record) Baseline (new TopicPartition) After (nested map)
Time 11.76 ns/op 10.80 ns/op (-8%)
Allocations 24 B/op ~0 B/op

This is a small per-record win; at high record rates it removes roughly 24 B of garbage per record on the put() path. Benchmark code is not included in this PR.

Risk Level

low

Behavior-preserving routing refactor: the secondary index mirrors transactionParticipants and is maintained at the same points, and lookups are equivalent to the previous TopicPartition-keyed lookup, read on the single task thread. The full hudi-kafka-connect unit suite passes. HoodieSinkTask.put is not directly unit-tested, so the change is intentionally a minimal mirror of the existing lookup.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…odieSinkTask.put

Route records through a secondary topic -> partition -> participant index instead of allocating a TopicPartition key per record. The index mirrors transactionParticipants and is maintained in bootstrap, close, and cleanup; the primary TopicPartition-keyed map is unchanged.
@voonhous voonhous added the area:performance Performance optimizations label Jun 16, 2026
@voonhous

Copy link
Copy Markdown
Member

The improvement here is marginal and there is a real risk is future maintainers breaking the two-map sync, which is a maintainability cost rather than a present-day bug.

Not sure if we should merge this in IMO. Would love to get a second opinion on this.

@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Jun 16, 2026

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR removes a per-record TopicPartition allocation in HoodieSinkTask.put() by maintaining a secondary topic -> partition -> participant index that mirrors transactionParticipants at the same lifecycle points (bootstrap, close, cleanup). The two maps stay in sync, and the lookup runs on the single Connect task thread so HashMap remains safe. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:performance Performance optimizations size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants