-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the pc-kernel issue #2144
Fix the pc-kernel issue #2144
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, very nice!
I'm +1 on the SingleInstanceTask
-like abstraction. On the one hand I feel like it's not necessarily blocking for this PR, but on the other it might help with unit testing some of this which otherwise sounds nightmarish.
If you want a coroutine function not to be called multiple times concurrently with asyncio, you can now use the async_helpers.exclusive decorator. This will wrap the coroutine function in a lock. LP: #2084032 Signed-off-by: Olivier Gayot <[email protected]>
The SystemGetter.get() method essentially does 3 important things: * it mounts stuff in /var/lib/snapd/seed * it queries snapd (and we expect snapd to read contents from /var/lib/snapd/seed) * it then tries to cleanup by unmounting what it previously mounted Unfortunately, if we cancel this task in the middle, especially while it is querying snapd, the cleanup code will fail to unmount things because snapd is still using /var/lib/snapd/seed (cancelling the query does not mean that snapd will immediately release the resources - or even stop processing the query). Typically this results in a failure to unmount pc-kernel.snap with EBUSY. Effectively, if SystemGetter.get() gets cancelled, we would need to wait until the query to snapd finishes before we run the cleanup code. But the approach that this patch takes is to make sure SystemGetter.get() is not cancelled in the first place. SystemGetter.get() was subject to cancellation because it runs as part of _examine_systems - which is cancelled and restarted when the source changes. We now make sure that we shield SystemGetter.get() from cancellation when we execute it from _examine_systems. If _examine_system gets cancelled, SystemGetter.get() will now continue running in the background. This leads to another problem. If we cancel _examine_system and rerun it, we will end up with two concurrent calls to SystemGetter.get() (the new one + the one we want to finish). Obviously, things go sideways if this happens. To avoid the problem, we now use the async_helpers.exclusive decorator for SystemGetter.get() - so that only one task runs the critical section at a given time. LP: #2084032 Signed-off-by: Olivier Gayot <[email protected]>
Internally, SystemGetter.get(variation) expects that the variation exists for the currently configured source. However, because of the lock we added to SystemGetter.get(variation), it is now more likely that the configured source has changed by the time we start mounting stuff and querying snapd. When that happens, SystemGetter.get(variation) fails with KeyError ; because the variation is no longer present in the currently configured source. Fixed by making sure that we use the source handler that was configured when SystemGetter.get(variation) was first invoked. LP: #2084032 Signed-off-by: Olivier Gayot <[email protected]>
When calling a coroutine function that is decorated with "async_helpers.exclusive", one can now optionally pass an asyncio.Event. If it is specified, the coroutine will set the event after it starts executing the body of the coroutine (i.e. after it acquires the lock). This is useful if the coroutine function is normally not safe to cancel in the middle of its critical section. Using this mechanism, the parent can know if the task can still be cancelled or if it is too late. Signed-off-by: Olivier Gayot <[email protected]>
Now that SystemGetter.get is marked async_helpers.exclusive, when we change the source multiple times quickly, we create a "queue" of SystemGetter.get tasks. This can take a long time to consume, and really we only need the result of the last query. When _examine_systems gets cancelled, we now check if the SystemGetter.get task has successfully acquired the lock. If it hasn't, then we cancel it. LP: #2084032 Signed-off-by: Olivier Gayot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of distressing that all this machinery is needed but I don't see any real scope for simplifications.
I have one very minor suggestion otherwise this LGTM! (and well done for fighting through it all!)
Initial problem
The
SystemGetter.get
method essentially does 3 important things:/var/lib/snapd/seed
/var/lib/snapd/seed
)Unfortunately, if we cancel this task in the middle, especially while it is querying snapd, the cleanup code will fail to unmount things because snapd is still using
/var/lib/snapd/seed
(cancelling the query does not mean that snapd will immediately release the resources - or even stop processing the query). Typically this results in a failure to unmount pc-kernel.snap with EBUSY.Effectively, if
SystemGetter.get
gets cancelled, we would need to wait until the query to snapd finishes before we run the cleanup code.But the approach that this PR takes is to make sure
SystemGetter.get
is not cancelled in the first place.SystemGetter.get
was subject to cancellation because it runs as part of_examine_systems
- which is cancelled and restarted when the source changes.We now make sure that we shield
SystemGetter.get
from cancellation when we execute it from_examine_systems
. If_examine_system
gets cancelled,SystemGetter.get
will now continue running in the background.Which leads to problem 2
If we cancel
_examine_system
and rerun it, we end up with two concurrent calls toSystemGetter.get
(the new one + the one we want to finish). Obviously, things go sideways if this happens.To avoid the problem, we now acquire a lock at the beginning of
SystemGetter.get
- so that only one task runs the critical section at a given time.Which leads to problem 3
Internally,
SystemGetter.get(variation)
expects that the variation exists for the currently configured source.However, because of the lock we added to
SystemGetter.get
, it is now more likely that the configured source has changed by the time we start mounting stuff and querying snapd.When that happens,
SystemGetter.get(variation)
fails withKeyError
; because the variation is no longer present in the currently configured source.Fixed by making sure that we use the source handler that was configured when
SystemGetter.get(variation)
was first invoked.This bug was already present but way more difficult to reproduce timing-wise.
Which leads to problem 4
With the lock added to
SystemGetter.get
, when we change the source multiple times quickly, we create a "queue" ofSystemGetter
.get tasks. This can take a long time to consume, and really we only care about the result of the last query.When
_examine_systems
gets cancelled, we now check if theSystemGetter.get
task has successfully acquired the lock. If it hasn't, then we cancel it.LP:#2084032