Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the pc-kernel issue #2144

Merged
merged 5 commits into from
Jan 28, 2025
Merged

Fix the pc-kernel issue #2144

merged 5 commits into from
Jan 28, 2025

Conversation

ogayot
Copy link
Member

@ogayot ogayot commented Jan 22, 2025

Initial problem

The SystemGetter.get method essentially does 3 important things:

  • it mounts stuff in /var/lib/snapd/seed
  • it queries snapd (and expect it to read contents from /var/lib/snapd/seed)
  • it then tries to cleanup by unmounting what it previously mounted

Unfortunately, if we cancel this task in the middle, especially while it is querying snapd, the cleanup code will fail to unmount things because snapd is still using /var/lib/snapd/seed (cancelling the query does not mean that snapd will immediately release the resources - or even stop processing the query). Typically this results in a failure to unmount pc-kernel.snap with EBUSY.

umount: /var/lib/snapd/seed/snaps/pc-kernel_2010.snap: target is busy.

Effectively, if SystemGetter.get gets cancelled, we would need to wait until the query to snapd finishes before we run the cleanup code.

But the approach that this PR takes is to make sure SystemGetter.get is not cancelled in the first place.

SystemGetter.get was subject to cancellation because it runs as part of _examine_systems - which is cancelled and restarted when the source changes.

We now make sure that we shield SystemGetter.get from cancellation when we execute it from _examine_systems. If _examine_system gets cancelled, SystemGetter.get will now continue running in the background.

Which leads to problem 2

If we cancel _examine_system and rerun it, we end up with two concurrent calls to SystemGetter.get (the new one + the one we want to finish). Obviously, things go sideways if this happens.

To avoid the problem, we now acquire a lock at the beginning of SystemGetter.get - so that only one task runs the critical section at a given time.

Which leads to problem 3

Internally, SystemGetter.get(variation) expects that the variation exists for the currently configured source.

However, because of the lock we added to SystemGetter.get, it is now more likely that the configured source has changed by the time we start mounting stuff and querying snapd.

When that happens, SystemGetter.get(variation) fails with KeyError ; because the variation is no longer present in the currently configured source.

Fixed by making sure that we use the source handler that was configured when SystemGetter.get(variation) was first invoked.

This bug was already present but way more difficult to reproduce timing-wise.

Which leads to problem 4

With the lock added to SystemGetter.get, when we change the source multiple times quickly, we create a "queue" of SystemGetter.get tasks. This can take a long time to consume, and really we only care about the result of the last query.

When _examine_systems gets cancelled, we now check if the SystemGetter.get task has successfully acquired the lock. If it hasn't, then we cancel it.

LP:#2084032

@ogayot ogayot marked this pull request as ready for review January 22, 2025 20:10
Copy link
Collaborator

@dbungert dbungert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see note

Copy link
Contributor

@Chris-Peterson444 Chris-Peterson444 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, very nice!

I'm +1 on the SingleInstanceTask-like abstraction. On the one hand I feel like it's not necessarily blocking for this PR, but on the other it might help with unit testing some of this which otherwise sounds nightmarish.

If you want a coroutine function not to be called multiple times concurrently
with asyncio, you can now use the async_helpers.exclusive decorator. This will
wrap the coroutine function in a lock.

LP: #2084032

Signed-off-by: Olivier Gayot <[email protected]>
The SystemGetter.get() method essentially does 3 important things:
 * it mounts stuff in /var/lib/snapd/seed
 * it queries snapd (and we expect snapd to read contents from /var/lib/snapd/seed)
 * it then tries to cleanup by unmounting what it previously mounted

Unfortunately, if we cancel this task in the middle, especially while it is
querying snapd, the cleanup code will fail to unmount things because snapd is
still using /var/lib/snapd/seed (cancelling the query does not mean that snapd
will immediately release the resources - or even stop processing the query).
Typically this results in a failure to unmount pc-kernel.snap with EBUSY.

Effectively, if SystemGetter.get() gets cancelled, we would need to wait until
the query to snapd finishes before we run the cleanup code.

But the approach that this patch takes is to make sure SystemGetter.get() is
not cancelled in the first place.

SystemGetter.get() was subject to cancellation because it runs as part of
_examine_systems - which is cancelled and restarted when the source changes.

We now make sure that we shield SystemGetter.get() from cancellation when we
execute it from _examine_systems. If _examine_system gets cancelled,
SystemGetter.get() will now continue running in the background.

This leads to another problem. If we cancel _examine_system and rerun it, we
will end up with two concurrent calls to SystemGetter.get() (the new one +
the one we want to finish). Obviously, things go sideways if this happens.

To avoid the problem, we now use the async_helpers.exclusive decorator for
SystemGetter.get() - so that only one task runs the critical section at a given
time.

LP: #2084032

Signed-off-by: Olivier Gayot <[email protected]>
Internally, SystemGetter.get(variation) expects that the variation exists for
the currently configured source.

However, because of the lock we added to SystemGetter.get(variation), it is now
more likely that the configured source has changed by the time we start
mounting stuff and querying snapd.

When that happens, SystemGetter.get(variation) fails with KeyError ; because
the variation is no longer present in the currently configured source.

Fixed by making sure that we use the source handler that was configured when
SystemGetter.get(variation) was first invoked.

LP: #2084032

Signed-off-by: Olivier Gayot <[email protected]>
When calling a coroutine function that is decorated with
"async_helpers.exclusive", one can now optionally pass an asyncio.Event.

If it is specified, the coroutine will set the event after it starts executing
the body of the coroutine (i.e. after it acquires the lock).

This is useful if the coroutine function is normally not safe to cancel in the
middle of its critical section. Using this mechanism, the parent can know if
the task can still be cancelled or if it is too late.

Signed-off-by: Olivier Gayot <[email protected]>
Now that SystemGetter.get is marked async_helpers.exclusive, when we change the
source multiple times quickly, we create a "queue" of SystemGetter.get tasks.
This can take a long time to consume, and really we only need the result of the
last query.

When _examine_systems gets cancelled, we now check if the SystemGetter.get task
has successfully acquired the lock. If it hasn't, then we cancel it.

LP: #2084032

Signed-off-by: Olivier Gayot <[email protected]>
Copy link
Collaborator

@mwhudson mwhudson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of distressing that all this machinery is needed but I don't see any real scope for simplifications.

I have one very minor suggestion otherwise this LGTM! (and well done for fighting through it all!)

@ogayot ogayot merged commit 23571a5 into canonical:main Jan 28, 2025
10 checks passed
@ogayot ogayot deleted the pc-kernel branch January 28, 2025 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants