🐛 Fix a bug where the priorityqueue would sometimes not return high-priority items first #3330

moritzmoe · 2025-09-29T14:58:07Z

Currently, items are sorted into the priority queue btree based on their attributes in the following order:

readyAt
priority
added counter

This leads to an issue where items requeued with a readyAt timestamp are always added to the queue behind items that are ready, even if they have higher priority. When the requeued items become ready and lower priority items ahead of them are still being processed, higher priority items won't be processed with priority despite being ready.

This is especially problematic during regular reconciles or initial start-ups where higher priority items are created and then requeued using a readyAt timestamp. These items must wait for lower priority items to be processed once they become ready, effectively not benefiting of the priority queue.

We therefore propose to adjust the mechanism for how items are sorted in the priority queue (the less() function) to always sort based on:

priority
readyAt
added counter

Looking at a simple example with three items:

Foo = {readyAt: nil, priority: -100, addedCounter: 1}
Bar = {readyAt: nil, priority: -100, addedCounter: 2}
Prio = {readyAt: 1s, priority: 0, addedCounter: 3}

(In reality, the readyAt timestamp is an absolute point in time; for simplicity, the illustrations here use relative units of time.)

The current implementation based on readiness → priority → addedCounter would add the items to the queue in the following way:

Foo, Bar, Prio

Now suppose we have one controller taking 2s to process an item. The queue would hand out the items to the controller in the following sequence: Foo → Bar → Prio. Notice that Bar is handed out before Prio even though Prio is already ready after Foo was processed for two seconds.

With the adjusted sorting based on priority → readiness → addedCounter, the queue would have the following structure:

Prio, Foo, Bar

Since Prio would not be ready when the controller first asks for an item, the first item to be handed out would still be Foo, but the sequence would then be Foo → Prio → Bar. Now, Prio (which has meanwhile become ready) is handed out before Bar because Prio's priority is higher than Bar's priority. A simplified version of this example is also illustrated as part of this PR via the test case: returns high priority item that became ready before low priority items.

With this new sorting, items that are not ready with a higher priority might be in the queue before items that are ready with a lower priority. This requires some adjustments to the spin() function responsible for handing out items to waiters.

Items are essentially sorted in priority groups, which are internally sorted by their readiness. This means that we need to be able to traverse the tree along the priority groups to check if the first item of each priority group has become ready and finally hand out the first ready item. For this purpose, we propose traversing the tree using a pivot item, effectively allowing us to skip large chunks of the binary tree. The following diagram illustrates this traversal:

Each color (blue, yellow, and green) forms a group of items with the same priority which are internally sorted based on their readiness. The pivot element starts at the first element in the tree—which due to the new sorting has the highest priority—checks it (1.1) and sets the nextReady timer (1.2) to its time. Because all the following items with the same priority have to be ready at the same time or later, the pivot element moves to the next (yellow) priority group (1.3). Here the process repeats: because the first item here is ready earlier, the nextReady timer is updated (2.2) and the pivot element moves to the next priority group where the first element is ready and can be handed out to a controller (3.2). Using the pivot element, we avoid having to traverse the whole tree to find the first ready item while still being able to set a timer for the next ready item by looking at the first item of each priority group. This guarantees a fast tree ascend even with a full queue containing many high priority items with a readyAt timestamp.

The following line chart of the workqueue_depth metric shows how our change affects the handling of high priority items that are requeued during the reconciliation of large amounts of low priority items.

The green line shows 600 low priority items simulating the base load while the yellow one shows 20 high priority items being requeued. On the left-hand side we see the current implementation where the items become ready but are only processed after the low priority items, and on the right-hand side the updated implementation from this pull request.

In addition, we propose fixing a bug that occurs when a tree ascend is performed to keep the metrics updated when no waiters are present. This ascend bears the risk that when a waiter becomes available during the ascend, the item at hand is handed out without considering its priority. To avoid this behavior, we propose introducing a metricsAscend flag that ensures the current ascend to update metrics is finished and the next item with high priority can be handed out through a regular new ascend. (moved out of this PR)

Co-authored-by: kstiehl <[email protected]>

linux-foundation-easycla · 2025-09-29T14:58:15Z

The committers listed above are authorized under a signed CLA.

✅ login: moritzmoe / name: Moritz (bd68c81, 944ef63, ac24cbc, b8e2760, ed5895f)

k8s-ci-robot · 2025-09-29T14:58:18Z

Hi @moritzmoe. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

alvaroaleman · 2025-09-29T15:13:56Z

/ok-to-test

k8s-ci-robot · 2025-09-30T12:00:04Z

@kstiehl: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-09-30T12:00:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kstiehl, moritzmoe
Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/controller/priorityqueue/priorityqueue_test.go

alvaroaleman · 2025-10-01T03:47:41Z

pkg/controller/priorityqueue/priorityqueue.go

+				ReadyAt:      nil,
+			}
+
+			for {


So the approach here is to iterate the priorities, starting with the highest until we find a ready item?

Especially with the metrics case, this feels complicated and hard to reason about to me. WDYT about having two btrees, one for not ready, one for ready, the first sorted by readAt, the second by priority and we move items from the first to the second when they become ready?

So the approach here is to iterate the priorities, starting with the highest until we find a ready item?

exactly, starting with the highest priority we use the pivot item to skip from one priority to the next in case the first item of the priority is not ready.

WDYT about having two btrees, one for not ready, one for ready, the first sorted by readAt, the second by priority and we move items from the first to the second when they become ready?

To me the single btree feels like the perfect data structure to handle the sorting of the queue based on readiness and priority at the same time. I think having two btrees would probably cause more memory allocations than necessary (adding, removing and moving items between trees) and not necessarily make the code less complex.

I don't really agree, because effectively we need two different sorting algorithms based on if an item is ready or not. Outside of that, we already have the problem that we need to update metrics if an item becomes ready, so having an explicit transition internally for that seems cleaner.
That being said, I don't currently have the time to deal with this and I guess this change is making things more correct. Can you please add a short code comment above the pivot declaration explaining a) the problem (sorting is different depending on if an item is ready or not) b) an explanation of the algo you implemented? This code is IMHO not super intuitive and the explanation on the PR body won't be visible to future readers of the code

alvaroaleman · 2025-10-01T03:59:59Z

In addition, we propose fixing a bug that occurs when a tree ascend is performed to keep the metrics updated when no waiters are present. This ascend bears the risk that when a waiter becomes available during the ascend, the item at hand is handed out without considering its priority.

Could we do a separate PR for that, ideally with test? We would need to interface out the atomic.Int64 and in the test insert one that returns 0 the first time, something not zero the second time to simulate this case

moritzmoe · 2025-10-02T15:36:34Z

Could we do a separate PR for that, ideally with test? We would need to interface out the atomic.Int64 and in the test insert one that returns 0 the first time, something not zero the second time to simulate this case

yes, we removed the metricsAscend flag for now

moritzmoe and others added 3 commits September 26, 2025 18:07

fix: adjust priority queue order and spin

ac24cbc

Co-authored-by: kstiehl <[email protected]>

fix: do not hand out item during metrics ascend

944ef63

Co-authored-by: kstiehl <[email protected]>

test: add test case

bd68c81

Co-authored-by: kstiehl <[email protected]>

k8s-ci-robot requested review from JoelSpeed and vincepri September 29, 2025 14:58

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 29, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 29, 2025

alvaroaleman added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 29, 2025

kstiehl approved these changes Sep 30, 2025

View reviewed changes

alvaroaleman reviewed Oct 1, 2025

View reviewed changes

moritzmoe added 2 commits October 2, 2025 17:31

rm async from test

ed5895f

rm metricsAscend flag

b8e2760

moritzmoe requested a review from alvaroaleman October 2, 2025 16:04

sbueringer mentioned this pull request Oct 3, 2025

Feature: Priority Queue #2374

Open

22 tasks

alvaroaleman changed the title ~~🐛 fix priority queue sorting~~ 🐛 Fix a bug where the priorityqueue would sometimes not correctly return high-priority items first Oct 4, 2025

alvaroaleman changed the title ~~🐛 Fix a bug where the priorityqueue would sometimes not correctly return high-priority items first~~ 🐛 Fix a bug where the priorityqueue would sometimes not return high-priority items first Oct 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 Fix a bug where the priorityqueue would sometimes not return high-priority items first #3330

🐛 Fix a bug where the priorityqueue would sometimes not return high-priority items first #3330

moritzmoe commented Sep 29, 2025 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Sep 29, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Sep 29, 2025

Uh oh!

alvaroaleman commented Sep 29, 2025

Uh oh!

k8s-ci-robot commented Sep 30, 2025

Uh oh!

k8s-ci-robot commented Sep 30, 2025

Uh oh!

Uh oh!

alvaroaleman Oct 1, 2025

Uh oh!

moritzmoe Oct 2, 2025

Uh oh!

alvaroaleman Oct 4, 2025 •

edited

Loading

Uh oh!

alvaroaleman commented Oct 1, 2025

Uh oh!

moritzmoe commented Oct 2, 2025

Uh oh!

Uh oh!

🐛 Fix a bug where the priorityqueue would sometimes not return high-priority items first #3330

Are you sure you want to change the base?

🐛 Fix a bug where the priorityqueue would sometimes not return high-priority items first #3330

Conversation

moritzmoe commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Sep 29, 2025

Uh oh!

alvaroaleman commented Sep 29, 2025

Uh oh!

k8s-ci-robot commented Sep 30, 2025

Uh oh!

k8s-ci-robot commented Sep 30, 2025

Uh oh!

Uh oh!

alvaroaleman Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

moritzmoe Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

alvaroaleman Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alvaroaleman commented Oct 1, 2025

Uh oh!

moritzmoe commented Oct 2, 2025

Uh oh!

Uh oh!

moritzmoe commented Sep 29, 2025 •

edited

Loading

linux-foundation-easycla bot commented Sep 29, 2025 •

edited

Loading

alvaroaleman Oct 4, 2025 •

edited

Loading