Skip to content

Conversation

moritzmoe
Copy link

@moritzmoe moritzmoe commented Sep 29, 2025

Currently, items are sorted into the priority queue btree based on their attributes in the following order:

  1. readyAt
  2. priority
  3. added counter

This leads to an issue where items requeued with a readyAt timestamp are always added to the queue behind items that are ready, even if they have higher priority. When the requeued items become ready and lower priority items ahead of them are still being processed, higher priority items won't be processed with priority despite being ready.

This is especially problematic during regular reconciles or initial start-ups where higher priority items are created and then requeued using a readyAt timestamp. These items must wait for lower priority items to be processed once they become ready, effectively not benefiting of the priority queue.


We therefore propose to adjust the mechanism for how items are sorted in the priority queue (the less() function) to always sort based on:

  1. priority
  2. readyAt
  3. added counter

Looking at a simple example with three items:

  • Foo = {readyAt: nil, priority: -100, addedCounter: 1}
  • Bar = {readyAt: nil, priority: -100, addedCounter: 2}
  • Prio = {readyAt: 1s, priority: 0, addedCounter: 3}

(In reality, the readyAt timestamp is an absolute point in time; for simplicity, the illustrations here use relative units of time.)

The current implementation based on readiness → priority → addedCounter would add the items to the queue in the following way:

Foo, Bar, Prio

Now suppose we have one controller taking 2s to process an item. The queue would hand out the items to the controller in the following sequence: Foo → Bar → Prio. Notice that Bar is handed out before Prio even though Prio is already ready after Foo was processed for two seconds.

With the adjusted sorting based on priority → readiness → addedCounter, the queue would have the following structure:

Prio, Foo, Bar

Since Prio would not be ready when the controller first asks for an item, the first item to be handed out would still be Foo, but the sequence would then be Foo → Prio → Bar. Now, Prio (which has meanwhile become ready) is handed out before Bar because Prio's priority is higher than Bar's priority. A simplified version of this example is also illustrated as part of this PR via the test case: returns high priority item that became ready before low priority items.


With this new sorting, items that are not ready with a higher priority might be in the queue before items that are ready with a lower priority. This requires some adjustments to the spin() function responsible for handing out items to waiters.

Items are essentially sorted in priority groups, which are internally sorted by their readiness. This means that we need to be able to traverse the tree along the priority groups to check if the first item of each priority group has become ready and finally hand out the first ready item. For this purpose, we propose traversing the tree using a pivot item, effectively allowing us to skip large chunks of the binary tree. The following diagram illustrates this traversal:

prioqueue

Each color (blue, yellow, and green) forms a group of items with the same priority which are internally sorted based on their readiness. The pivot element starts at the first element in the tree—which due to the new sorting has the highest priority—checks it (1.1) and sets the nextReady timer (1.2) to its time. Because all the following items with the same priority have to be ready at the same time or later, the pivot element moves to the next (yellow) priority group (1.3). Here the process repeats: because the first item here is ready earlier, the nextReady timer is updated (2.2) and the pivot element moves to the next priority group where the first element is ready and can be handed out to a controller (3.2). Using the pivot element, we avoid having to traverse the whole tree to find the first ready item while still being able to set a timer for the next ready item by looking at the first item of each priority group. This guarantees a fast tree ascend even with a full queue containing many high priority items with a readyAt timestamp.


The following line chart of the workqueue_depth metric shows how our change affects the handling of high priority items that are requeued during the reconciliation of large amounts of low priority items.

image

The green line shows 600 low priority items simulating the base load while the yellow one shows 20 high priority items being requeued. On the left-hand side we see the current implementation where the items become ready but are only processed after the low priority items, and on the right-hand side the updated implementation from this pull request.


In addition, we propose fixing a bug that occurs when a tree ascend is performed to keep the metrics updated when no waiters are present. This ascend bears the risk that when a waiter becomes available during the ascend, the item at hand is handed out without considering its priority. To avoid this behavior, we propose introducing a metricsAscend flag that ensures the current ascend to update metrics is finished and the next item with high priority can be handed out through a regular new ascend. (moved out of this PR)

Copy link

linux-foundation-easycla bot commented Sep 29, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 29, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @moritzmoe. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 29, 2025
@alvaroaleman
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 29, 2025
@alvaroaleman alvaroaleman added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 29, 2025
@k8s-ci-robot
Copy link
Contributor

@kstiehl: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kstiehl, moritzmoe
Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ReadyAt: nil,
}

for {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the approach here is to iterate the priorities, starting with the highest until we find a ready item?

Especially with the metrics case, this feels complicated and hard to reason about to me. WDYT about having two btrees, one for not ready, one for ready, the first sorted by readAt, the second by priority and we move items from the first to the second when they become ready?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the approach here is to iterate the priorities, starting with the highest until we find a ready item?

exactly, starting with the highest priority we use the pivot item to skip from one priority to the next in case the first item of the priority is not ready.

WDYT about having two btrees, one for not ready, one for ready, the first sorted by readAt, the second by priority and we move items from the first to the second when they become ready?

To me the single btree feels like the perfect data structure to handle the sorting of the queue based on readiness and priority at the same time. I think having two btrees would probably cause more memory allocations than necessary (adding, removing and moving items between trees) and not necessarily make the code less complex.

Copy link
Member

@alvaroaleman alvaroaleman Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really agree, because effectively we need two different sorting algorithms based on if an item is ready or not. Outside of that, we already have the problem that we need to update metrics if an item becomes ready, so having an explicit transition internally for that seems cleaner.
That being said, I don't currently have the time to deal with this and I guess this change is making things more correct. Can you please add a short code comment above the pivot declaration explaining a) the problem (sorting is different depending on if an item is ready or not) b) an explanation of the algo you implemented? This code is IMHO not super intuitive and the explanation on the PR body won't be visible to future readers of the code

@alvaroaleman
Copy link
Member

In addition, we propose fixing a bug that occurs when a tree ascend is performed to keep the metrics updated when no waiters are present. This ascend bears the risk that when a waiter becomes available during the ascend, the item at hand is handed out without considering its priority.

Could we do a separate PR for that, ideally with test? We would need to interface out the atomic.Int64 and in the test insert one that returns 0 the first time, something not zero the second time to simulate this case

@moritzmoe
Copy link
Author

Could we do a separate PR for that, ideally with test? We would need to interface out the atomic.Int64 and in the test insert one that returns 0 the first time, something not zero the second time to simulate this case

yes, we removed the metricsAscend flag for now

@moritzmoe moritzmoe requested a review from alvaroaleman October 2, 2025 16:04
@sbueringer sbueringer mentioned this pull request Oct 3, 2025
22 tasks
@alvaroaleman alvaroaleman changed the title 🐛 fix priority queue sorting 🐛 Fix a bug where the priorityqueue would sometimes not correctly return high-priority items first Oct 4, 2025
@alvaroaleman alvaroaleman changed the title 🐛 Fix a bug where the priorityqueue would sometimes not correctly return high-priority items first 🐛 Fix a bug where the priorityqueue would sometimes not return high-priority items first Oct 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants