Skip to content

Port tests to test.thing #156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

allisonkarlitskaya
Copy link
Collaborator

...probably still broken, but let's see how it goes.

@allisonkarlitskaya allisonkarlitskaya force-pushed the test.thing branch 11 times, most recently from ca2a2b0 to 5bfa302 Compare July 15, 2025 21:32
@cgwalters
Copy link
Collaborator

Just skimming test.thing looks quite cool but definitely overlaps with several other efforts/projects this space.

@allisonkarlitskaya
Copy link
Collaborator Author

Just skimming test.thing looks quite cool but definitely overlaps with several other efforts/projects this space.

This work is really being done mostly for the benefit of Cockpit. I've looked into some of the other solutions (mkosi qemu was the closest fit) but after a chat with Daan at devconf and offline after, our usecase was different enough that I decided to make a try at my own thing...

The sheer amount of enablement work that I've been having to do in the distros is a bit of a hint to me that this approach is fairly new...

@allisonkarlitskaya
Copy link
Collaborator Author

Just skimming test.thing looks quite cool but definitely overlaps with several other efforts/projects this space.

Did you have a suggestion for a comparable project I should be looking into?

@allisonkarlitskaya allisonkarlitskaya force-pushed the test.thing branch 8 times, most recently from d7c2a5d to f0651fb Compare July 16, 2025 21:02
@allisonkarlitskaya
Copy link
Collaborator Author

The ubuntu thing is a bit perplexing. I can't get it to fail locally and it was working on previous iterations of the PR (or at least failing with the error that it had before due to the mount API regressions).

I'm slowly starting to suspect that some update or other is causing some issues...

@allisonkarlitskaya allisonkarlitskaya force-pushed the test.thing branch 2 times, most recently from b152cfc to 2c6238f Compare July 16, 2025 22:14
@allisonkarlitskaya allisonkarlitskaya force-pushed the test.thing branch 4 times, most recently from cf3711e to 9048ebe Compare July 17, 2025 19:32
@allisonkarlitskaya allisonkarlitskaya force-pushed the test.thing branch 2 times, most recently from 3f005fd to d59e76a Compare July 17, 2025 20:31
Installing the build dependencies and building the tools takes a long
time.

Signed-off-by: Allison Karlitskaya <[email protected]>
I often disable the `rm -rf tmp` line to make iterating quicker (by
removing the need to pull and write previous layers and objects) but in
order for that to work you still need to delete a couple of things.  Add
a second line so that you can comment out the first one while you're
working, like:

  # rm -rf tmp
  rm -rf tmp/efi tmp/sysroot/composefs/images

Signed-off-by: Allison Karlitskaya <[email protected]>
Run the "install packages into the base image" section as the first
thing that happens, before the cfsctl binary gets copied in.

This means that we have to regenerate the initramfs, which is duplicated
work, but it enables caching of the package installation, which takes
much longer and involves downloading updated packages (which drift with
time).  This helps a lot with iterative local testing, and the bls
images are the ones that I usually use for that.

Also enable package caching for Arch and Debian by adding the
appropriate bind mounts.  Use `apt-get` instead of CLI-API-unstable
`apt`.  Use the new `kernel-install add-all` verb where applicable.

We leave Ubuntu out of this change for now: I keep ending up with two
initramfs files generated and I'm not sure how to workaround the weird
use of dracut there.  We can circle back to this later.

Signed-off-by: Allison Karlitskaya <[email protected]>
This is a similar logic to the previous commit doing the same for bls/
but we make an additional change by installing the kernel up-front.
This helps caching, of course, but it also eliminates the "missing
modules" problem that required us to force the modules we needed to be
present in the initramfs.  This is going to be important when we start
using more modules like vsock support.

Signed-off-by: Allison Karlitskaya <[email protected]>
We need to avoid https://bugzilla.redhat.com/show_bug.cgi?id=2374928 but
`semanage` seems to have a bug when invoked more than once in a
container build, so we can't just add another invocation.  Use a module
instead, to workaround both issues in one go.

Signed-off-by: Allison Karlitskaya <[email protected]>
We want to start running images with test.thing, so add the workarounds
for improved ssh-vsock support.  These are no longer necessary in arch
and fedora-rawhide.

Unfortunately we can't put these in common/ because it's outside of the
build context.  Using an extra build context also seems not to work
because symlinks aren't copied properly unless it's from the primary
context.

Make a small fix to a comment in examples/uki/Containerfile that should
have been cleaned up as part of a4cbd3e ("Update approach to
handling boot resources").

Signed-off-by: Allison Karlitskaya <[email protected]>
@allisonkarlitskaya allisonkarlitskaya force-pushed the test.thing branch 2 times, most recently from 2165ef8 to 962e7f6 Compare July 17, 2025 22:05
Drop our dependency on cockpit-bots (checked out from its git repository
and requiring libvirt and other heavy dependencies) and switch over to
using test.thing (vendored) via pytest.

We no longer install ssh keys into the images: test.thing generates an
ephemeral key on each run and feeds it into the guest.

Expand examples/README.md to describe how this is all intended to be
used.

Adjust our github workflows appropriately.  The systemd version on the
runner isn't new enough to have systemd-ssh-proxy, so install our
polyfill.  We also need to make sure the vhost-vsock is accessible to
the user in the same way as /dev/kvm.

Signed-off-by: Allison Karlitskaya <[email protected]>
@allisonkarlitskaya
Copy link
Collaborator Author

allisonkarlitskaya commented Jul 18, 2025

Facts:

Observations:

  • I've never seen the crash locally
  • the crash seems to go away (most of the time? always?) when disabling ephemeral ssh key generation. This happens regardless of if the cockpit-test key is still in the image. ie: ephemeral ssh keys with a static key in the image will crash, but not generating the keys avoids the crash. I have a feeling this is related to the smbios strings we send in for the ephemeral key. It could also be timing-related, but the time it takes to generate an ed25519 key is absolutely trivial.
  • manual inspection also seems to change things. The crash is potentially timing-sensitive. See the above comment about "definitely some random element"
  • on a hunch that qemu was producing bad smbios tables that caused a crash in the early kernel DMI parsing, I upgraded the qemu installed on the runners to the version from plucky, but it didn't fix the problem. If qemu is involved here and the reason that I've never seen it locally is because my local version was fixed, then the fix happened between plucky and what I'm using locally. That's a lot of "if".
  • the fact that the exact image that's already failing in CI is now failing in CI for a new reason, and all of the ones that were working before are working properly is not lost on me. It's a heck of a coincidence. But I think it really is just a coincidence.
  • I tried to pin the Ubuntu image back to the kernel version from plucky (we're currently on ubuntu:devel) but it didn't fix the problem
  • an earlier version of this PR tried to include Ubuntu in the bls "caching" commit. That caused us to end up with two initrd files produced and both of them got included in the bootloader entry. Manual inspection showed me that under some conditions we'd end up with an initramfs that was missing erofs support, which seemed pretty weird, because other times it would work. This again seemed related to the presence or absence of smbios strings (which systemd-boot does parse, and we use for adding kernel commandline arguments, in particular console=hvc0). Again: the fact that it was the Ubuntu image that got affected here and none of the others seems extremely coincidental, and again, I do believe that it was just a coincidence.

I feel like this is probably a bug buried deep inside of something or other that has nothing to do with us and is very likely already fixed upstream, but I'd still prefer not to just ignore it. At the very least, this is valuable information for determining minimum recommended versions for test.thing or coming up with a reliable workaround.

Next ideas:

  • try sending our credentials in to the guest base64-encoded to see if that changes something. systemd supports this. Maybe it's something weird with the handling of whitespace or something.
  • try to see if it's related to the version of the EFI BIOS inside of the guest. I tried upgrading qemu but I didn't try upgrading ovmf.
  • maybe something completely unrelated to credentials. I feel like I've maybe gone slightly tunnel-vision on that one and it could very well be something entirely unrelated.

we want to see very very early kernel messages
@allisonkarlitskaya
Copy link
Collaborator Author

Okay, the change in the serial console debugging got us one more message:

EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path

The next line I get after that when I run locally (ie: when it works) is

[ 0.000000] Linux version 6.15.0-4-generic (buildd@lcy02-amd64-054) (x86_64-linux-gnu-gcc-14 (Ubuntu 14.3.0-1ubuntu1) 14.3.0, GNU ld (GNU Binutils for Ubuntu) 2.44.50.20250616) #4-Ubuntu SMP PREEMPT_DYNAMIC Fri Jul 4 14:41:53 UTC 2025 (Ubuntu 6.15.0-4.4-generic 6.15.0)

@allisonkarlitskaya
Copy link
Collaborator Author

Okay here we go....

   | [    0.660617] LSM: initializing lsm=lockdown,capability,landlock,yama,apparmor,ima,evm
   | [    0.660617] LSM: initializing lsm=lockdown,capability,landlock,yama,apparmor,ima,evm
   | [    0.662470] landlock: Up and running.
   | [    0.662470] landlock: Up and running.
   | [    0.664448] Yama: becoming mindful.
   | [    0.664448] Yama: becoming mindful.
   | [    0.666501] AppArmor: AppArmor initialized

@allisonkarlitskaya
Copy link
Collaborator Author

So it seems to be getting stuck in the security/integrity/ima code at this point which does a lot of platformy kinda stuff with measurements and TPM and such... that passes a gut-check, I guess...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants