Port tests to `test.thing` #156

allisonkarlitskaya · 2025-07-15T19:16:09Z

...probably still broken, but let's see how it goes.

cgwalters · 2025-07-15T22:28:35Z

Just skimming test.thing looks quite cool but definitely overlaps with several other efforts/projects this space.

allisonkarlitskaya · 2025-07-16T00:33:26Z

Just skimming test.thing looks quite cool but definitely overlaps with several other efforts/projects this space.

This work is really being done mostly for the benefit of Cockpit. I've looked into some of the other solutions (mkosi qemu was the closest fit) but after a chat with Daan at devconf and offline after, our usecase was different enough that I decided to make a try at my own thing...

The sheer amount of enablement work that I've been having to do in the distros is a bit of a hint to me that this approach is fairly new...

allisonkarlitskaya · 2025-07-16T00:33:54Z

Just skimming test.thing looks quite cool but definitely overlaps with several other efforts/projects this space.

Did you have a suggestion for a comparable project I should be looking into?

allisonkarlitskaya · 2025-07-16T21:24:41Z

The ubuntu thing is a bit perplexing. I can't get it to fail locally and it was working on previous iterations of the PR (or at least failing with the error that it had before due to the mount API regressions).

I'm slowly starting to suspect that some update or other is causing some issues...

Signed-off-by: Allison Karlitskaya <[email protected]>

Installing the build dependencies and building the tools takes a long time. Signed-off-by: Allison Karlitskaya <[email protected]>

I often disable the `rm -rf tmp` line to make iterating quicker (by removing the need to pull and write previous layers and objects) but in order for that to work you still need to delete a couple of things. Add a second line so that you can comment out the first one while you're working, like: # rm -rf tmp rm -rf tmp/efi tmp/sysroot/composefs/images Signed-off-by: Allison Karlitskaya <[email protected]>

Run the "install packages into the base image" section as the first thing that happens, before the cfsctl binary gets copied in. This means that we have to regenerate the initramfs, which is duplicated work, but it enables caching of the package installation, which takes much longer and involves downloading updated packages (which drift with time). This helps a lot with iterative local testing, and the bls images are the ones that I usually use for that. Also enable package caching for Arch and Debian by adding the appropriate bind mounts. Use `apt-get` instead of CLI-API-unstable `apt`. Use the new `kernel-install add-all` verb where applicable. We leave Ubuntu out of this change for now: I keep ending up with two initramfs files generated and I'm not sure how to workaround the weird use of dracut there. We can circle back to this later. Signed-off-by: Allison Karlitskaya <[email protected]>

This is a similar logic to the previous commit doing the same for bls/ but we make an additional change by installing the kernel up-front. This helps caching, of course, but it also eliminates the "missing modules" problem that required us to force the modules we needed to be present in the initramfs. This is going to be important when we start using more modules like vsock support. Signed-off-by: Allison Karlitskaya <[email protected]>

We need to avoid https://bugzilla.redhat.com/show_bug.cgi?id=2374928 but `semanage` seems to have a bug when invoked more than once in a container build, so we can't just add another invocation. Use a module instead, to workaround both issues in one go. Signed-off-by: Allison Karlitskaya <[email protected]>

We want to start running images with test.thing, so add the workarounds for improved ssh-vsock support. These are no longer necessary in arch and fedora-rawhide. Unfortunately we can't put these in common/ because it's outside of the build context. Using an extra build context also seems not to work because symlinks aren't copied properly unless it's from the primary context. Make a small fix to a comment in examples/uki/Containerfile that should have been cleaned up as part of a4cbd3e ("Update approach to handling boot resources"). Signed-off-by: Allison Karlitskaya <[email protected]>

Drop our dependency on cockpit-bots (checked out from its git repository and requiring libvirt and other heavy dependencies) and switch over to using test.thing (vendored) via pytest. We no longer install ssh keys into the images: test.thing generates an ephemeral key on each run and feeds it into the guest. Expand examples/README.md to describe how this is all intended to be used. Adjust our github workflows appropriately. The systemd version on the runner isn't new enough to have systemd-ssh-proxy, so install our polyfill. We also need to make sure the vhost-vsock is accessible to the user in the same way as /dev/kvm. Signed-off-by: Allison Karlitskaya <[email protected]>

allisonkarlitskaya · 2025-07-18T20:49:03Z

Facts:

the Ubuntu image (and only the Ubuntu image) is crashing
the crash happens very early during kernel startup before any console output appears. I've added timeout support and console logging to test.thing to confirm that: the only thing you see is systemd-boot telling us that it'll boot the image in 1s. I've also confirmed that during manual testing.
there is definitely some random element to the crash. I've seen the exact same commit ID work (ie: fail with the expected failure that we've been seeing on Ubuntu for ages) and then not work in a separate run:
- attempt 1 (working): https://github.com/containers/composefs-rs/actions/runs/16357795305/job/46219910738?pr=156
- attempt 2 (failing): https://github.com/containers/composefs-rs/actions/runs/16357795305/job/46220342374?pr=156

Observations:

I've never seen the crash locally
the crash seems to go away (most of the time? always?) when disabling ephemeral ssh key generation. This happens regardless of if the cockpit-test key is still in the image. ie: ephemeral ssh keys with a static key in the image will crash, but not generating the keys avoids the crash. I have a feeling this is related to the smbios strings we send in for the ephemeral key. It could also be timing-related, but the time it takes to generate an ed25519 key is absolutely trivial.
manual inspection also seems to change things. The crash is potentially timing-sensitive. See the above comment about "definitely some random element"
on a hunch that qemu was producing bad smbios tables that caused a crash in the early kernel DMI parsing, I upgraded the qemu installed on the runners to the version from plucky, but it didn't fix the problem. If qemu is involved here and the reason that I've never seen it locally is because my local version was fixed, then the fix happened between plucky and what I'm using locally. That's a lot of "if".
the fact that the exact image that's already failing in CI is now failing in CI for a new reason, and all of the ones that were working before are working properly is not lost on me. It's a heck of a coincidence. But I think it really is just a coincidence.
I tried to pin the Ubuntu image back to the kernel version from plucky (we're currently on ubuntu:devel) but it didn't fix the problem
an earlier version of this PR tried to include Ubuntu in the bls "caching" commit. That caused us to end up with two initrd files produced and both of them got included in the bootloader entry. Manual inspection showed me that under some conditions we'd end up with an initramfs that was missing erofs support, which seemed pretty weird, because other times it would work. This again seemed related to the presence or absence of smbios strings (which systemd-boot does parse, and we use for adding kernel commandline arguments, in particular console=hvc0). Again: the fact that it was the Ubuntu image that got affected here and none of the others seems extremely coincidental, and again, I do believe that it was just a coincidence.

I feel like this is probably a bug buried deep inside of something or other that has nothing to do with us and is very likely already fixed upstream, but I'd still prefer not to just ignore it. At the very least, this is valuable information for determining minimum recommended versions for test.thing or coming up with a reliable workaround.

Next ideas:

try sending our credentials in to the guest base64-encoded to see if that changes something. systemd supports this. Maybe it's something weird with the handling of whitespace or something.
try to see if it's related to the version of the EFI BIOS inside of the guest. I tried upgrading qemu but I didn't try upgrading ovmf.
maybe something completely unrelated to credentials. I feel like I've maybe gone slightly tunnel-vision on that one and it could very well be something entirely unrelated.

we want to see very very early kernel messages

allisonkarlitskaya · 2025-07-19T14:13:18Z

Okay, the change in the serial console debugging got us one more message:

EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path

The next line I get after that when I run locally (ie: when it works) is

[ 0.000000] Linux version 6.15.0-4-generic (buildd@lcy02-amd64-054) (x86_64-linux-gnu-gcc-14 (Ubuntu 14.3.0-1ubuntu1) 14.3.0, GNU ld (GNU Binutils for Ubuntu) 2.44.50.20250616) #4-Ubuntu SMP PREEMPT_DYNAMIC Fri Jul 4 14:41:53 UTC 2025 (Ubuntu 6.15.0-4.4-generic 6.15.0)

allisonkarlitskaya · 2025-07-19T15:59:09Z

Okay here we go....

   | [    0.660617] LSM: initializing lsm=lockdown,capability,landlock,yama,apparmor,ima,evm
   | [    0.660617] LSM: initializing lsm=lockdown,capability,landlock,yama,apparmor,ima,evm
   | [    0.662470] landlock: Up and running.
   | [    0.662470] landlock: Up and running.
   | [    0.664448] Yama: becoming mindful.
   | [    0.664448] Yama: becoming mindful.
   | [    0.666501] AppArmor: AppArmor initialized

allisonkarlitskaya · 2025-07-21T19:45:27Z

So it seems to be getting stuck in the security/integrity/ima code at this point which does a lot of platformy kinda stuff with measurements and TPM and such... that passes a gut-check, I guess...

allisonkarlitskaya force-pushed the test.thing branch 11 times, most recently from ca2a2b0 to 5bfa302 Compare July 15, 2025 21:32

allisonkarlitskaya force-pushed the test.thing branch 8 times, most recently from d7c2a5d to f0651fb Compare July 16, 2025 21:02

allisonkarlitskaya force-pushed the test.thing branch 2 times, most recently from b152cfc to 2c6238f Compare July 16, 2025 22:14

allisonkarlitskaya mentioned this pull request Jul 17, 2025

Initial cleanups for #156 #157

Draft

allisonkarlitskaya force-pushed the test.thing branch 4 times, most recently from cf3711e to 9048ebe Compare July 17, 2025 19:32

allisonkarlitskaya force-pushed the test.thing branch 2 times, most recently from 3f005fd to d59e76a Compare July 17, 2025 20:31

allisonkarlitskaya added 6 commits July 17, 2025 17:32

examples: decrease systemd-boot timeout to 1

5518bf7

Signed-off-by: Allison Karlitskaya <[email protected]>

.github/workflows: cache "patched tools"

05ee4c7

Installing the build dependencies and building the tools takes a long time. Signed-off-by: Allison Karlitskaya <[email protected]>

allisonkarlitskaya force-pushed the test.thing branch from 7929465 to 87de9f0 Compare July 17, 2025 21:34

allisonkarlitskaya force-pushed the test.thing branch 2 times, most recently from 2165ef8 to 962e7f6 Compare July 17, 2025 22:05

allisonkarlitskaya force-pushed the test.thing branch from 06fd3ea to d9c316f Compare July 17, 2025 22:56

switch serial console driver

c81f518

we want to see very very early kernel messages

allisonkarlitskaya added 2 commits July 19, 2025 11:23

moar!!!

4f4410c

sigh

87b27b1

allisonkarlitskaya added 2 commits July 19, 2025 12:01

disable apparmor

f93e23b

plucky, please

21a50e6

allisonkarlitskaya force-pushed the test.thing branch from 6473252 to 21a50e6 Compare July 19, 2025 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port tests to `test.thing` #156

Port tests to `test.thing` #156

Uh oh!

allisonkarlitskaya commented Jul 15, 2025

Uh oh!

cgwalters commented Jul 15, 2025

Uh oh!

allisonkarlitskaya commented Jul 16, 2025

Uh oh!

allisonkarlitskaya commented Jul 16, 2025

Uh oh!

allisonkarlitskaya commented Jul 16, 2025

Uh oh!

allisonkarlitskaya commented Jul 18, 2025 •

edited

Loading

Uh oh!

allisonkarlitskaya commented Jul 19, 2025

Uh oh!

allisonkarlitskaya commented Jul 19, 2025

Uh oh!

allisonkarlitskaya commented Jul 21, 2025

Uh oh!

Uh oh!

Port tests to test.thing #156

Are you sure you want to change the base?

Port tests to test.thing #156

Uh oh!

Conversation

allisonkarlitskaya commented Jul 15, 2025

Uh oh!

cgwalters commented Jul 15, 2025

Uh oh!

allisonkarlitskaya commented Jul 16, 2025

Uh oh!

allisonkarlitskaya commented Jul 16, 2025

Uh oh!

allisonkarlitskaya commented Jul 16, 2025

Uh oh!

allisonkarlitskaya commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allisonkarlitskaya commented Jul 19, 2025

Uh oh!

allisonkarlitskaya commented Jul 19, 2025

Uh oh!

allisonkarlitskaya commented Jul 21, 2025

Uh oh!

Uh oh!

Port tests to `test.thing` #156

Port tests to `test.thing` #156

allisonkarlitskaya commented Jul 18, 2025 •

edited

Loading