-
Notifications
You must be signed in to change notification settings - Fork 1.2k
🐛 Fix a bug where the priorityqueue would sometimes not return high-priority items first #3330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
🐛 Fix a bug where the priorityqueue would sometimes not return high-priority items first #3330
Conversation
Co-authored-by: kstiehl <[email protected]>
Co-authored-by: kstiehl <[email protected]>
Co-authored-by: kstiehl <[email protected]>
Hi @moritzmoe. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
@kstiehl: changing LGTM is restricted to collaborators In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kstiehl, moritzmoe The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ReadyAt: nil, | ||
} | ||
|
||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the approach here is to iterate the priorities, starting with the highest until we find a ready item?
Especially with the metrics case, this feels complicated and hard to reason about to me. WDYT about having two btrees, one for not ready, one for ready, the first sorted by readAt, the second by priority and we move items from the first to the second when they become ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the approach here is to iterate the priorities, starting with the highest until we find a ready item?
exactly, starting with the highest priority we use the pivot item to skip from one priority to the next in case the first item of the priority is not ready.
WDYT about having two btrees, one for not ready, one for ready, the first sorted by readAt, the second by priority and we move items from the first to the second when they become ready?
To me the single btree feels like the perfect data structure to handle the sorting of the queue based on readiness and priority at the same time. I think having two btrees would probably cause more memory allocations than necessary (adding, removing and moving items between trees) and not necessarily make the code less complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really agree, because effectively we need two different sorting algorithms based on if an item is ready or not. Outside of that, we already have the problem that we need to update metrics if an item becomes ready, so having an explicit transition internally for that seems cleaner.
That being said, I don't currently have the time to deal with this and I guess this change is making things more correct. Can you please add a short code comment above the pivot declaration explaining a) the problem (sorting is different depending on if an item is ready or not) b) an explanation of the algo you implemented? This code is IMHO not super intuitive and the explanation on the PR body won't be visible to future readers of the code
Could we do a separate PR for that, ideally with test? We would need to interface out the atomic.Int64 and in the test insert one that returns 0 the first time, something not zero the second time to simulate this case |
yes, we removed the |
Currently, items are sorted into the priority queue btree based on their attributes in the following order:
This leads to an issue where items requeued with a
readyAt
timestamp are always added to the queue behind items that are ready, even if they have higher priority. When the requeued items become ready and lower priority items ahead of them are still being processed, higher priority items won't be processed with priority despite being ready.This is especially problematic during regular reconciles or initial start-ups where higher priority items are created and then requeued using a
readyAt
timestamp. These items must wait for lower priority items to be processed once they become ready, effectively not benefiting of the priority queue.We therefore propose to adjust the mechanism for how items are sorted in the priority queue (the
less()
function) to always sort based on:Looking at a simple example with three items:
Foo = {readyAt: nil, priority: -100, addedCounter: 1}
Bar = {readyAt: nil, priority: -100, addedCounter: 2}
Prio = {readyAt: 1s, priority: 0, addedCounter: 3}
(In reality, the
readyAt
timestamp is an absolute point in time; for simplicity, the illustrations here use relative units of time.)The current implementation based on readiness → priority → addedCounter would add the items to the queue in the following way:
Now suppose we have one controller taking 2s to process an item. The queue would hand out the items to the controller in the following sequence: Foo → Bar → Prio. Notice that
Bar
is handed out beforePrio
even thoughPrio
is already ready afterFoo
was processed for two seconds.With the adjusted sorting based on priority → readiness → addedCounter, the queue would have the following structure:
Since
Prio
would not be ready when the controller first asks for an item, the first item to be handed out would still beFoo
, but the sequence would then be Foo → Prio → Bar. Now,Prio
(which has meanwhile become ready) is handed out beforeBar
becausePrio
's priority is higher thanBar
's priority. A simplified version of this example is also illustrated as part of this PR via the test case: returns high priority item that became ready before low priority items.With this new sorting, items that are not ready with a higher priority might be in the queue before items that are ready with a lower priority. This requires some adjustments to the
spin()
function responsible for handing out items to waiters.Items are essentially sorted in priority groups, which are internally sorted by their readiness. This means that we need to be able to traverse the tree along the priority groups to check if the first item of each priority group has become ready and finally hand out the first ready item. For this purpose, we propose traversing the tree using a pivot item, effectively allowing us to skip large chunks of the binary tree. The following diagram illustrates this traversal:
Each color (blue, yellow, and green) forms a group of items with the same priority which are internally sorted based on their readiness. The pivot element starts at the first element in the tree—which due to the new sorting has the highest priority—checks it (1.1) and sets the
nextReady
timer (1.2) to its time. Because all the following items with the same priority have to be ready at the same time or later, the pivot element moves to the next (yellow) priority group (1.3). Here the process repeats: because the first item here is ready earlier, thenextReady
timer is updated (2.2) and the pivot element moves to the next priority group where the first element is ready and can be handed out to a controller (3.2). Using the pivot element, we avoid having to traverse the whole tree to find the first ready item while still being able to set a timer for the next ready item by looking at the first item of each priority group. This guarantees a fast tree ascend even with a full queue containing many high priority items with areadyAt
timestamp.The following line chart of the
workqueue_depth
metric shows how our change affects the handling of high priority items that are requeued during the reconciliation of large amounts of low priority items.The green line shows 600 low priority items simulating the base load while the yellow one shows 20 high priority items being requeued. On the left-hand side we see the current implementation where the items become ready but are only processed after the low priority items, and on the right-hand side the updated implementation from this pull request.
In addition, we propose fixing a bug that occurs when a tree ascend is performed to keep the metrics updated when no waiters are present. This ascend bears the risk that when a waiter becomes available during the ascend, the item at hand is handed out without considering its priority. To avoid this behavior, we propose introducing a(moved out of this PR)metricsAscend
flag that ensures the current ascend to update metrics is finished and the next item with high priority can be handed out through a regular new ascend.