perf(kafka-connect): avoid per-record TopicPartition allocation in HoodieSinkTask.put#19018
Conversation
…odieSinkTask.put Route records through a secondary topic -> partition -> participant index instead of allocating a TopicPartition key per record. The index mirrors transactionParticipants and is maintained in bootstrap, close, and cleanup; the primary TopicPartition-keyed map is unchanged.
|
The improvement here is marginal and there is a real risk is future maintainers breaking the two-map sync, which is a maintainability cost rather than a present-day bug. Not sure if we should merge this in IMO. Would love to get a second opinion on this. |
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR removes a per-record TopicPartition allocation in HoodieSinkTask.put() by maintaining a secondary topic -> partition -> participant index that mirrors transactionParticipants at the same lifecycle points (bootstrap, close, cleanup). The two maps stay in sync, and the lookup runs on the single Connect task thread so HashMap remains safe. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
Describe the issue this Pull Request addresses
HoodieSinkTask.putallocates a newTopicPartition(topic, partition)for every incoming record solely to look the record's participant up in thetransactionParticipantsmap, then discards it. On a high-throughput sink this is one short-lived allocation per record. A JMH micro-benchmark confirms the allocation is real and is not eliminated by escape analysis.Summary and Changelog
Maintain a secondary
topic -> partition -> participantindex alongside the existingtransactionParticipantsmap, populated and cleared at the same lifecycle points (bootstrap,close,cleanup). Route records through this index input()using a topic string lookup plus a partitionintlookup (small ints are cached by the JVM), which removes the per-recordTopicPartitionallocation. The primaryTopicPartition-keyed map is unchanged and still used by the assignment loop,preCommit, and partition close.Impact
Performance only; no public API or behavior change. JMH micro-benchmark of routing one record to its participant (AverageTime mode, gc profiler):
This is a small per-record win; at high record rates it removes roughly 24 B of garbage per record on the
put()path. Benchmark code is not included in this PR.Risk Level
low
Behavior-preserving routing refactor: the secondary index mirrors
transactionParticipantsand is maintained at the same points, and lookups are equivalent to the previousTopicPartition-keyed lookup, read on the single task thread. The fullhudi-kafka-connectunit suite passes.HoodieSinkTask.putis not directly unit-tested, so the change is intentionally a minimal mirror of the existing lookup.Documentation Update
none
Contributor's checklist