Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noble backport for pc kernel fix #2151

Merged
merged 5 commits into from
Jan 28, 2025
Merged

Conversation

ogayot
Copy link
Member

@ogayot ogayot commented Jan 28, 2025

Testing done

Although I haven't managed to reproduce the exact unmount issue myself using the 24.04.1 ISO, I was able to mess up the contents of /var/lib/snapd/seeds by changing the source multiple times. Things that could go wrong before the patch include:

  • The pc.snap or pc-kernel.snap staying mounted in /var/lib/snapd/seed/snaps forever
  • The enhanced-secureboot-desktop staying mounted in /var/lib/snapd/seed/systems forever
  • squashfs-es staying mounted forever
  • "wrong-size" errors upon restarting the installer

All these are fixed after applying the patches - which means the cleanup code is working properly. And therefore no pc-kernel unmount issue should occur.

If you want a coroutine function not to be called multiple times concurrently
with asyncio, you can now use the async_helpers.exclusive decorator. This will
wrap the coroutine function in a lock.

LP: #2084032

Signed-off-by: Olivier Gayot <[email protected]>
(cherry picked from commit 3f340c1)

Backported so it does not run the new unittest on older versions of Python -
since they do not support asyncio.Barrier.
The _get_system() method essentially does 3 important things:
 * it mounts stuff in /var/lib/snapd/seed
 * it queries snapd (and we expect snapd to read contents from /var/lib/snapd/seed)
 * it then tries to cleanup by unmounting what it previously mounted

Unfortunately, if we cancel this task in the middle, especially while it is
querying snapd, the cleanup code will fail to unmount things because snapd is
still using /var/lib/snapd/seed (cancelling the query does not mean that snapd
will immediately release the resources - or even stop processing the query).
Typically this results in a failure to unmount pc-kernel.snap with EBUSY.

Effectively, if _get_system() gets cancelled, we would need to wait until
the query to snapd finishes before we run the cleanup code.

But the approach that this patch takes is to make sure _get_system() is
not cancelled in the first place.

_get_system() was subject to cancellation because it runs as part of
_examine_systems - which is cancelled and restarted when the source changes.

We now make sure that we shield _get_system() from cancellation when we
execute it from _examine_systems. If _examine_system gets cancelled,
_get_system() will now continue running in the background.

This leads to another problem. If we cancel _examine_system and rerun it, we
will end up with two concurrent calls to _get_system() (the new one +
the one we want to finish). Obviously, things go sideways if this happens.

To avoid the problem, we now use the async_helpers.exclusive decorator for
_get_system() - so that only one task runs the critical section at a given
time.

LP: #2084032

Signed-off-by: Olivier Gayot <[email protected]>
(cherry picked from commit 1f94353)

Backported to use _get_system() instead of SystemGetter.get() from the original
patch.
Internally, _get_system(variation) expects that the variation exists for
the currently configured source.

However, because of the lock we added to _get_system(variation), it is now
more likely that the configured source has changed by the time we start
mounting stuff and querying snapd.

When that happens, _get_system(variation) fails with KeyError ; because
the variation is no longer present in the currently configured source.

Fixed by making sure that we use the source handler that was configured when
_get_system(variation) was first invoked.

LP: #2084032

Signed-off-by: Olivier Gayot <[email protected]>
(cherry picked from commit 65983be)

Backported to use _get_system() instead of SystemGetter.get() from the original
patch.
When calling a coroutine function that is decorated with
"async_helpers.exclusive", one can now optionally pass an asyncio.Event.

If it is specified, the coroutine will set the event after it starts executing
the body of the coroutine (i.e. after it acquires the lock).

This is useful if the coroutine function is normally not safe to cancel in the
middle of its critical section. Using this mechanism, the parent can know if
the task can still be cancelled or if it is too late.

Signed-off-by: Olivier Gayot <[email protected]>
(cherry picked from commit 99be092)
Now that _get_systems is marked async_helpers.exclusive, when we change the
source multiple times quickly, we create a "queue" of _get_systems tasks.
This can take a long time to consume, and really we only need the result of the
last query.

When _examine_systems gets cancelled, we now check if the _get_systems task
has successfully acquired the lock. If it hasn't, then we cancel it.

LP: #2084032

Signed-off-by: Olivier Gayot <[email protected]>
(cherry picked from commit 9218456)

Backported to use _get_system() instead of SystemGetter.get() from the original
patch.
@ogayot ogayot merged commit 0558731 into canonical:ubuntu/noble Jan 28, 2025
12 checks passed
@ogayot ogayot deleted the noble-pc-kernel branch January 28, 2025 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants