Fix the pc-kernel issue #2144

ogayot · 2025-01-22T20:07:44Z

Initial problem

The SystemGetter.get method essentially does 3 important things:

it mounts stuff in /var/lib/snapd/seed
it queries snapd (and expect it to read contents from /var/lib/snapd/seed)
it then tries to cleanup by unmounting what it previously mounted

Unfortunately, if we cancel this task in the middle, especially while it is querying snapd, the cleanup code will fail to unmount things because snapd is still using /var/lib/snapd/seed (cancelling the query does not mean that snapd will immediately release the resources - or even stop processing the query). Typically this results in a failure to unmount pc-kernel.snap with EBUSY.

umount: /var/lib/snapd/seed/snaps/pc-kernel_2010.snap: target is busy.

Effectively, if SystemGetter.get gets cancelled, we would need to wait until the query to snapd finishes before we run the cleanup code.

But the approach that this PR takes is to make sure SystemGetter.get is not cancelled in the first place.

SystemGetter.get was subject to cancellation because it runs as part of _examine_systems - which is cancelled and restarted when the source changes.

We now make sure that we shield SystemGetter.get from cancellation when we execute it from _examine_systems. If _examine_system gets cancelled, SystemGetter.get will now continue running in the background.

Which leads to problem 2

If we cancel _examine_system and rerun it, we end up with two concurrent calls to SystemGetter.get (the new one + the one we want to finish). Obviously, things go sideways if this happens.

To avoid the problem, we now acquire a lock at the beginning of SystemGetter.get - so that only one task runs the critical section at a given time.

Which leads to problem 3

Internally, SystemGetter.get(variation) expects that the variation exists for the currently configured source.

However, because of the lock we added to SystemGetter.get, it is now more likely that the configured source has changed by the time we start mounting stuff and querying snapd.

When that happens, SystemGetter.get(variation) fails with KeyError ; because the variation is no longer present in the currently configured source.

Fixed by making sure that we use the source handler that was configured when SystemGetter.get(variation) was first invoked.

This bug was already present but way more difficult to reproduce timing-wise.

Which leads to problem 4

With the lock added to SystemGetter.get, when we change the source multiple times quickly, we create a "queue" of SystemGetter.get tasks. This can take a long time to consume, and really we only care about the result of the last query.

When _examine_systems gets cancelled, we now check if the SystemGetter.get task has successfully acquired the lock. If it hasn't, then we cancel it.

LP:#2084032

dbungert

see note

subiquity/server/controllers/filesystem.py

Chris-Peterson444

Wow, very nice!

I'm +1 on the SingleInstanceTask-like abstraction. On the one hand I feel like it's not necessarily blocking for this PR, but on the other it might help with unit testing some of this which otherwise sounds nightmarish.

If you want a coroutine function not to be called multiple times concurrently with asyncio, you can now use the async_helpers.exclusive decorator. This will wrap the coroutine function in a lock. LP: #2084032 Signed-off-by: Olivier Gayot <[email protected]>

The SystemGetter.get() method essentially does 3 important things: * it mounts stuff in /var/lib/snapd/seed * it queries snapd (and we expect snapd to read contents from /var/lib/snapd/seed) * it then tries to cleanup by unmounting what it previously mounted Unfortunately, if we cancel this task in the middle, especially while it is querying snapd, the cleanup code will fail to unmount things because snapd is still using /var/lib/snapd/seed (cancelling the query does not mean that snapd will immediately release the resources - or even stop processing the query). Typically this results in a failure to unmount pc-kernel.snap with EBUSY. Effectively, if SystemGetter.get() gets cancelled, we would need to wait until the query to snapd finishes before we run the cleanup code. But the approach that this patch takes is to make sure SystemGetter.get() is not cancelled in the first place. SystemGetter.get() was subject to cancellation because it runs as part of _examine_systems - which is cancelled and restarted when the source changes. We now make sure that we shield SystemGetter.get() from cancellation when we execute it from _examine_systems. If _examine_system gets cancelled, SystemGetter.get() will now continue running in the background. This leads to another problem. If we cancel _examine_system and rerun it, we will end up with two concurrent calls to SystemGetter.get() (the new one + the one we want to finish). Obviously, things go sideways if this happens. To avoid the problem, we now use the async_helpers.exclusive decorator for SystemGetter.get() - so that only one task runs the critical section at a given time. LP: #2084032 Signed-off-by: Olivier Gayot <[email protected]>

Internally, SystemGetter.get(variation) expects that the variation exists for the currently configured source. However, because of the lock we added to SystemGetter.get(variation), it is now more likely that the configured source has changed by the time we start mounting stuff and querying snapd. When that happens, SystemGetter.get(variation) fails with KeyError ; because the variation is no longer present in the currently configured source. Fixed by making sure that we use the source handler that was configured when SystemGetter.get(variation) was first invoked. LP: #2084032 Signed-off-by: Olivier Gayot <[email protected]>

When calling a coroutine function that is decorated with "async_helpers.exclusive", one can now optionally pass an asyncio.Event. If it is specified, the coroutine will set the event after it starts executing the body of the coroutine (i.e. after it acquires the lock). This is useful if the coroutine function is normally not safe to cancel in the middle of its critical section. Using this mechanism, the parent can know if the task can still be cancelled or if it is too late. Signed-off-by: Olivier Gayot <[email protected]>

Now that SystemGetter.get is marked async_helpers.exclusive, when we change the source multiple times quickly, we create a "queue" of SystemGetter.get tasks. This can take a long time to consume, and really we only need the result of the last query. When _examine_systems gets cancelled, we now check if the SystemGetter.get task has successfully acquired the lock. If it hasn't, then we cancel it. LP: #2084032 Signed-off-by: Olivier Gayot <[email protected]>

mwhudson

It's kind of distressing that all this machinery is needed but I don't see any real scope for simplifications.

I have one very minor suggestion otherwise this LGTM! (and well done for fighting through it all!)

subiquitycore/async_helpers.py

ogayot force-pushed the pc-kernel branch from 5c91609 to bb0e7de Compare January 22, 2025 20:10

ogayot marked this pull request as ready for review January 22, 2025 20:10

dbungert reviewed Jan 23, 2025

View reviewed changes

subiquity/server/controllers/filesystem.py Show resolved Hide resolved

Chris-Peterson444 reviewed Jan 24, 2025

View reviewed changes

ogayot added 5 commits January 24, 2025 12:23

ogayot force-pushed the pc-kernel branch from bb0e7de to 9218456 Compare January 24, 2025 11:55

Chris-Peterson444 approved these changes Jan 24, 2025

View reviewed changes

dbungert approved these changes Jan 27, 2025

View reviewed changes

mwhudson approved these changes Jan 27, 2025

View reviewed changes

subiquitycore/async_helpers.py Show resolved Hide resolved

ogayot merged commit 23571a5 into canonical:main Jan 28, 2025
10 checks passed

ogayot deleted the pc-kernel branch January 28, 2025 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the pc-kernel issue #2144

Fix the pc-kernel issue #2144

ogayot commented Jan 22, 2025

dbungert left a comment

Chris-Peterson444 left a comment

mwhudson left a comment

Fix the pc-kernel issue #2144

Fix the pc-kernel issue #2144

Conversation

ogayot commented Jan 22, 2025

dbungert left a comment

Choose a reason for hiding this comment

Chris-Peterson444 left a comment

Choose a reason for hiding this comment

mwhudson left a comment

Choose a reason for hiding this comment