KAFKA-19478 [3/N]: Use heaps to discover the least loaded process #20172

lucasbru · 2025-07-15T14:44:59Z

The original implementation uses a linear search to find the least
loaded process in O(n), and we can replace this by look-ups in a heap is
O(log(n)), as described below

Active tasks: For active tasks, we can do exactly the same assignment as
in the original algorithm by first building a heap (by load) of all
processes. When we assign a task, we pick the head off the heap, assign
the task to it, update the load, and re-insert it into the heap in
O(log(n)).

Standby tasks: For standby tasks, we cannot do this optimization
directly, because of the order in which we assign tasks:

We first try to assign task A to a process that previously owned A.
If we did not find such a process, we assign A to the least loaded
node.
We now try to assign task B to a process that previously owned B
If we did not find such a process, we assign B to the least loaded
node
...

The problem is that we cannot efficiently keep a heap (by load)
throughout this process, because finding and removing process that
previously owned A (and B and…) in the heap is O(n). We therefore need
to change the order of evaluation to be able to use a heap:

Try to assign all tasks A, B.. to a process that previously owned the
task
Build a heap.
Assign all remaining tasks to the least-loaded process that does not
yet own the task. Since at most NumStandbyReplicas already own the task,
we can do it by removing up to NumStandbyReplicas from the top of the
heap in O(log(n)), so we get O(log(NumProcesses)*NumStandbyReplicas).

Note that the change in order changes the resulting standby assignments
(although this difference does not show up in the existing unit tests).
I would argue that the new order of assignment will actually yield
better assignments, since the assignment will be more sticky, which has
the potential to reduce the amount of store we have to restore from the
changelog topic after assingments.

In our worst-performing benchmark, this improves the runtime by ~107x.

The original implementation uses a linear search to find the least loaded process in O(n), and we can replace this by look-ups in a heap is O(log(n)), as described below Active tasks: For active tasks, we can do exactly the same assignment as in the original algorithm by first building a heap (by load) of all processes. When we assign a task, we pick the head off the heap, assign the task to it, update the load, and re-insert it into the heap in O(log(n)). Standby tasks: For standby tasks, we cannot do this optimization directly, because of the order in which we assign tasks: 1. We first try to assign task A to a process that previously owned A. 2. If we did not find such a process, we assign A to the least loaded node. 3. We now try to assign task B to a process that previously owned B 4. If we did not find such a process, we assign B to the least loaded node ... The problem is that we cannot efficiently keep a heap (by load) throughout this process, because finding and removing process that previously owned A (and B and…) in the heap is O(n). We therefore need to change the order of evaluation to be able to use a heap: 1. Try to assign all tasks A, B.. to a process that previously owned the task 2. Build a heap. 3. Assign all remaining tasks to the least-loaded process that does not yet own the task. Since at most NumStandbyReplicas already own the task, we can do it by removing up to NumStandbyReplicas from the top of the heap in O(log(n)), so we get O(log(NumProcesses)*NumStandbyReplicas). Note that the change in order changes the resulting standby assignments (although this difference does not show up in the existing unit tests). I would argue that the new order of assignment will actually yield better assignments, since the assignment will be more sticky, which has the potential to reduce the amount of store we have to restore from the changelog topic after assingments. In our worst-performing benchmark, this improves the runtime by ~107x.

lucasbru · 2025-07-21T12:43:50Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

        final double load = process.load();
        boolean isLeastLoadedProcess = localState.processIdToState.values().stream()
            .allMatch(p -> p.load() >= load);
        return process.hasCapacity() || isLeastLoadedProcess;
    }

-    private void updateHelpers(final Member member, final boolean isActive) {


I just inlined this function

lucasbru · 2025-07-21T12:44:23Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

@@ -301,21 +348,13 @@ private String errorMessage(final int numStandbyReplicas, final int i, final Tas
            " of " + numStandbyReplicas + " standby tasks for task [" + task + "].";
    }

-    private boolean isLoadBalanced(final String processId) {
-        final ProcessState process = localState.processIdToState.get(processId);
+    private boolean isLoadBalanced(final ProcessState process) {


minor: passing the process in here saves us from looking it up in the hashmap again.

lucasbru · 2025-07-21T12:45:33Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

-                updateHelpers(prevMember, true);
+                ProcessState processState = localState.processIdToState.get(prevMember.processId);
+                processState.addTask(prevMember.memberId, task, true);
+                maybeUpdateTasksPerMember(processState.activeTaskCount());


Just inlining updateHelpers

lucasbru · 2025-07-21T12:46:05Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

@@ -115,7 +117,7 @@ private void initialize(final GroupSpec groupSpec, final TopologyDescriber topol
                Set<Integer> partitionNoSet = entry.getValue();
                for (int partitionNo : partitionNoSet) {
                    TaskId taskId = new TaskId(entry.getKey(), partitionNo);
-                    localState.standbyTaskToPrevMember.putIfAbsent(taskId, new HashSet<>());
+                    localState.standbyTaskToPrevMember.putIfAbsent(taskId, new ArrayList<>());


minor: no need to deduplicate here, so I'd rather just use arrays

lucasbru · 2025-07-21T12:46:31Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

                it.remove();
            }
        }

        // 3. assign any remaining unassigned tasks
+        PriorityQueue<ProcessState> processByLoad = new PriorityQueue<>(Comparator.comparingDouble(ProcessState::load));
+        processByLoad.addAll(localState.processIdToState.values());


Initial build of the priority queue by load

lucasbru · 2025-07-21T12:57:02Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

    }

    private boolean hasUnfulfilledQuota(final Member member) {
        return localState.processIdToState.get(member.processId).memberToTaskCounts().get(member.memberId) < localState.tasksPerMember;
    }

    private void assignStandby(final Set<TaskId> standbyTasks, final int numStandbyReplicas) {
+        ArrayList<StandbyToAssign> toLeastLoaded = new ArrayList<>(standbyTasks.size() * numStandbyReplicas);


This arrayList is used to store all standby tasks that we couldn't assign to a member that previously owned that task, and needs to be assigned to the "least loaded" node.

lucasbru · 2025-07-21T13:01:45Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

+     *
+     * @return Previous member with the least load that deoes not have the task, or null if no such member exists.
+     */
+    private Member findPrevMemberWithLeastLoad(final ArrayList<Member> members, final TaskId taskId) {


findPrevMemberWithLeastLoad works very similar to the old findMemberWithLeastLoad - that is, it does a linear search among a collection of candidates.

However, since we don't use it anymore to find the least loaded node among all members anymore - we use a priority queue there.

This is only used to select the least loaded node among all members that previously owned the task. I replaced the Java Streams based iteration with a loop, since it's more efficient.

lucasbru · 2025-07-21T13:06:13Z

...tests/src/test/java/org/apache/kafka/streams/integration/SmokeTestDriverIntegrationTest.java

        if (streamsProtocolEnabled) {
            props.put(StreamsConfig.GROUP_PROTOCOL_CONFIG, GroupProtocol.STREAMS.name().toLowerCase(Locale.getDefault()));
+            // decrease the session timeout so that we can trigger the rebalance soon after old client left closed
+            CLUSTER.setGroupSessionTimeout(appId, 10000);
+            CLUSTER.setGroupHeartbeatTimeout(appId, 1000);


The integration test set the session timeout and the heartbeat interval incorrectly before in the new protocol. We need to set it on the group level.

This sometimes made the test flaky with the new assignment algorithm, since we iteratively cycle out the "oldest" member, and tend to assign the tasks from the next-oldest member. But, due to the high session timeout and heartbeat timeout, it could sometimes take too long for the new member to get the new tasks assigned, before being cycled out as well.

I don't see a problem in the assignment logic here - actually, it seems useful to assign tasks to "old" members, since they are stable. We are just cycling out the tasks to quickly in this integration test.

lucasbru · 2025-07-21T13:06:42Z

...ion-tests/src/test/java/org/apache/kafka/streams/integration/utils/EmbeddedKafkaCluster.java

@@ -401,6 +401,8 @@ public KafkaConsumer<byte[], byte[]> createConsumerAndSubscribeTo(final Map<Stri

    private void addDefaultBrokerPropsIfAbsent(final Properties brokerConfig) {
        brokerConfig.putIfAbsent(CleanerConfig.LOG_CLEANER_DEDUPE_BUFFER_SIZE_PROP, 2 * 1024 * 1024L);
+        brokerConfig.putIfAbsent(GroupCoordinatorConfig.STREAMS_GROUP_MIN_SESSION_TIMEOUT_MS_CONFIG, "100");


For integration tests, we want to use rather extreme values so that they run quickly.

lucasbru · 2025-07-21T13:06:52Z

...s/src/test/java/org/apache/kafka/streams/integration/StandbyTaskCreationIntegrationTest.java

@@ -99,7 +99,7 @@ private Properties streamsConfiguration(final boolean streamsProtocolEnabled) {
        streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.IntegerSerde.class);
        if (streamsProtocolEnabled) {
            streamsConfiguration.put(StreamsConfig.GROUP_PROTOCOL_CONFIG, GroupProtocol.STREAMS.name().toLowerCase(Locale.getDefault()));
-            CLUSTER.setStandbyReplicas("app-" + safeTestName, 1);
+            CLUSTER.setGroupStandbyReplicas("app-" + safeTestName, 1);


Just renamed

Copilot

Pull Request Overview

This PR implements a heap-based optimization for the Kafka Streams task assignor to improve performance by replacing O(n) linear searches with O(log n) heap operations when finding the least loaded process. The optimization uses priority queues to efficiently manage process load ordering during both active and standby task assignment.

Key changes:

Replace linear search with heap-based lookups for finding least loaded processes
Refactor standby task assignment order to enable heap optimization
Add utility methods and data structures to support the heap-based approach

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
StickyTaskAssignor.java	Core algorithmic changes implementing heap-based task assignment with new data structures and methods
EmbeddedKafkaCluster.java	Add utility methods for configuring group timeouts and rename existing method for consistency
StandbyTaskCreationIntegrationTest.java	Update method call to use renamed setGroupStandbyReplicas method
SmokeTestDriverIntegrationTest.java	Refactor timeout configuration to use new group-specific methods when streams protocol is enabled

Copilot · 2025-07-21T13:08:16Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

+                throw new TaskAssignorException("No process available to assign active task {}." + task);
+            }
+            String member = memberWithLeastLoad(processWithLeastLoad);
            if (member == null) {
-                log.error("Unable to assign active task {} to any member.", task);
                throw new TaskAssignorException("No member available to assign active task {}." + task);


The error message format is incorrect. The placeholder '{}' is not being used properly with string concatenation. Should be either 'No process available to assign active task ' + task + '.' or use proper string formatting.

Copilot · 2025-07-21T13:08:17Z

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

+                throw new TaskAssignorException("No process available to assign active task {}." + task);
+            }
+            String member = memberWithLeastLoad(processWithLeastLoad);
            if (member == null) {
-                log.error("Unable to assign active task {} to any member.", task);
                throw new TaskAssignorException("No member available to assign active task {}." + task);


The error message format is incorrect. The placeholder '{}' is not being used properly with string concatenation. Should be either 'No member available to assign active task ' + task + '.' or use proper string formatting.

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

…roup/streams/assignor/StickyTaskAssignor.java Co-authored-by: Copilot <[email protected]>

github-actions bot added streams group-coordinator labels Jul 15, 2025

lucasbru added 2 commits July 21, 2025 14:42

fixup! many fixes

50b029d

lucasbru force-pushed the jmh_benchmarks_third_opt branch from 94707f5 to 50b029d Compare July 21, 2025 12:42

lucasbru commented Jul 21, 2025

View reviewed changes

lucasbru requested review from bbejeck, Copilot and mjsax July 21, 2025 13:07

Copilot AI reviewed Jul 21, 2025

View reviewed changes

Update group-coordinator/src/main/java/org/apache/kafka/coordinator/g…

b4b2ba2

…roup/streams/assignor/StickyTaskAssignor.java Co-authored-by: Copilot <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-19478 [3/N]: Use heaps to discover the least loaded process #20172

KAFKA-19478 [3/N]: Use heaps to discover the least loaded process #20172

lucasbru commented Jul 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

lucasbru Jul 21, 2025

Uh oh!

lucasbru Jul 21, 2025

Uh oh!

lucasbru Jul 21, 2025

Uh oh!

lucasbru Jul 21, 2025

Uh oh!

lucasbru Jul 21, 2025

Uh oh!

lucasbru Jul 21, 2025

Uh oh!

lucasbru Jul 21, 2025

Uh oh!

lucasbru Jul 21, 2025

Uh oh!

lucasbru Jul 21, 2025

Uh oh!

lucasbru Jul 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KAFKA-19478 [3/N]: Use heaps to discover the least loaded process #20172

Are you sure you want to change the base?

KAFKA-19478 [3/N]: Use heaps to discover the least loaded process #20172

Conversation

lucasbru commented Jul 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucasbru commented Jul 15, 2025 •

edited by github-actions bot

Loading