[no-relnote] implement a nestedContainerRunner for E2E test suite #1235

ArangoGutierrez · 2025-08-08T16:15:18Z

This pull request refactors the test setup for nvidia-container-cli end-to-end tests by introducing a new abstraction for running commands inside nested containers. The main logic for preparing and launching test containers, installing Docker and NVIDIA toolkit components, and executing test scripts has been moved from the test file to a new nestedContainerRunner type in runner.go. This change simplifies the test code and improves maintainability by encapsulating container setup and teardown logic.

The nestedContainerRunner also allows us to introduce a new E2E test for the nvidia-cdi-refresh systemd unit.

These changes make the end-to-end tests easier to read and maintain, and centralize the logic for running nested containers.

Copilot

Pull Request Overview

This PR implements a nested container runner for the E2E test suite, allowing tests to run containers inside other containers using either remote or local runners.

Adds a new nestedContainerRunner type that wraps existing runners and executes commands inside nested containers
Moves Docker and Container Toolkit installation templates from test files to runner.go for reusability
Refactors the nvidia-container-cli test to use the new nested container runner instead of inline container setup

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
tests/e2e/runner.go	Implements the core nestedContainerRunner with Docker/CTK installation templates and container management
tests/e2e/nvidia-container-cli_test.go	Refactors test to use the new nested container runner, removing duplicate container setup code

Copilot · 2025-08-08T16:16:20Z

tests/e2e/runner.go

+
+	// If a container with the same name exists from a previous test run, remove it first.
+	// Ignore errors as container might not exist
+	_, _, err := runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck


The error check is performed but the result is not used. The //nolint:errcheck comment suggests intentional error ignoring, but then the error is checked on line 155. Either remove the error check or remove the nolint comment.

Suggested change

_, _, err := runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck

_, _, err := runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName))

Copilot · 2025-08-08T16:16:21Z

tests/e2e/runner.go

+
+		output = strings.TrimSpace(output)
+		if output == "" {
+			return nil, fmt.Errorf("no toolkit libraries found") //nolint:goerr113


[nitpick] The //nolint:goerr113 comment suggests using wrapped errors is intentionally avoided, but using errors.New() would be more appropriate than fmt.Errorf() for static error messages without formatting.

Suggested change

return nil, fmt.Errorf("no toolkit libraries found") //nolint:goerr113

return nil, errors.New("no toolkit libraries found") //nolint:goerr113

Copilot · 2025-08-08T16:16:21Z

tests/e2e/runner.go

+	}
+
+	// install docker in the nested container
+	_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", installDockerTemplate))


[nitpick] String concatenation in the middle of fmt.Sprintf is inconsistent with the formatting pattern. Consider using fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, installDockerTemplate) for better readability.

Suggested change

_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", installDockerTemplate))

_, _, err = runner.Run(fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, installDockerTemplate))

Copilot · 2025-08-08T16:16:21Z

tests/e2e/runner.go

+			return nil, fmt.Errorf("failed to execute installCTK template: %w", err)
+		}
+
+		_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", toolkitInstall.String()))


[nitpick] Same inconsistent string concatenation pattern as line 216. Consider using fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, toolkitInstall.String()) for consistency.

Suggested change

_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", toolkitInstall.String()))

_, _, err = runner.Run(fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, toolkitInstall.String()))

Copilot · 2025-08-08T16:16:21Z

tests/e2e/runner.go

@@ -131,6 +301,10 @@ func (r remoteRunner) Run(script string) (string, string, error) {
 	return stdout.String(), "", nil
 }

+func (r nestedContainerRunner) Run(script string) (string, string, error) {
+	return r.runner.Run(fmt.Sprintf("docker exec -u root "+r.containerName+" bash -c '%s'", script))


[nitpick] Same inconsistent string concatenation pattern. Consider using fmt.Sprintf("docker exec -u root %s bash -c '%s'", r.containerName, script) for consistency with the codebase formatting style.

Suggested change

return r.runner.Run(fmt.Sprintf("docker exec -u root "+r.containerName+" bash -c '%s'", script))

return r.runner.Run(fmt.Sprintf("docker exec -u root %s bash -c '%s'", r.containerName, script))

coveralls · 2025-08-08T16:22:32Z

Pull Request Test Coverage Report for Build 17156139177

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 35.749%

Totals
Change from base Build 17110221655:	0.0%
Covered Lines:	4657
Relevant Lines:	13027

💛 - Coveralls

elezar · 2025-08-11T12:00:24Z

tests/e2e/runner.go

+	startTestContainerTemplate = `docker run -d --name {{.ContainerName}} --privileged --runtime=nvidia \
+    -e NVIDIA_VISIBLE_DEVICES=runtime.nvidia.com/gpu=all \
+    -e NVIDIA_DRIVER_CAPABILITIES=all \
+	{{ range $i, $a := .AdditionalArguments -}}
+	{{ $a }} \
+	{{ end -}}
+	ubuntu sleep infinity`


Is this not the only thing that is required for the nested container runner? The other scripts that have been moved here are test-specific. (The one exception is the additional arguments required to mount the host installed libraries into the container that is started).

Not really.
The template installDockerTemplate is needed to install Docker inside Docker to enable nested runs.
The template installCTKTemplate with the installCTK when calling NewNestedContainerRunner will allow us to create nestedRunners with or without the toolkit preinstalled inside the nested environment.

That was my reasoning to move all the templates to the runner.go, leaving only libnvidiaContainerCliTestTemplate on the nvidia-container-cli_test.go file, as this is indeed test specific template

When we add logic to install podman in the container, would that also be implemented in the runner.go. Also, looking forward, how do we reuse the nested runner to also run other toolkit tests?

When we add logic to install podman in the container, would that also be implemented in the runner.go

We could modify the newNestedRunner to select the nested runtime.

looking forward, how do we reuse the nested runner to also run other toolkit tests?

You mean rewriting the current tests with the nestedRunner?

I don't mean that we should rewrite the current tests with the nestedRunner, but that the current tests should be able to run against a nestedRunner already.

As a question: from the perspective of the existing toolkit tests, would we expect a difference if these are running directly on an instance (i.e. the current runner), or in a containerized OS (the nested runner)? For all tests what we do is:

Start a test environment (an AWS instance, an outer container) with the NVIDIA Driver available

Install the prerequsites (currently Docker, but could be extended to Podman)

At this point we currently have a differences between the two classes of tests. For one of these we use toolkit container to install the toolkit to the host. For the second, we copy the packages out of the container and installs these on the "host" (actually the outer container). Note that for the systemd tests, this is what we would be "testing".

elezar · 2025-08-12T12:39:36Z

As a general note. Our motiviation for adding this runner is to be able to test the generation of CDI specs by the systemd unit. Does it make sense to add that to this test as well so that the required functionality drives the types of changes we make?

ArangoGutierrez · 2025-08-12T14:13:10Z

As a general note. Our motiviation for adding this runner is to be able to test the generation of CDI specs by the systemd unit. Does it make sense to add that to this test as well so that the required functionality drives the types of changes we make?

OK, I just wanted a small, focused PR, but I am ok with expanding the scope of the PR. I'll work on it

elezar · 2025-08-19T13:02:42Z

As discussed offline: The intent of the "nested" Runner is to allow us to test other environments -- such as different target operating systems -- using our existin tests. With this in mind, we need to rework how we're constructing our "runner" and move it closer to being infrastructure instead of being so closely tied to each test.

Ideally the same tests should be able to be run against:

The version of the toolkit already installed on the system
The version under test in CI installed on an AWS instance
The version under test installed in an "outer" container with a specific distribution

Note that if the last of these can be made reliable enough, then the middle one (i.e. installing directly on an AWS instance) should not even be required.

elezar · 2025-08-22T09:22:27Z

tests/e2e/nvidia-cdi-refresh_test.go

+)
+
+const (
+	nvidiaCdiRefreshServiceTestTemplate = `


Question: Is there an advantage to having this as a single test script instead of separate It contexts?

You are right, I have divided it into self-contained IT contexts, now it runs as follows:

nvidia-cdi-refresh when installing nvidia-container-toolkit should load the nvidia-cdi-refresh.path and nvidia-cdi-refresh.service units [systemd-unit] /Users/eduardoa/src/github/nvidia/nvidia-container-toolkit/tests/e2e/nvidia-cdi-refresh_test.go:105 • [60.608 seconds] ------------------------------ nvidia-cdi-refresh when installing nvidia-container-toolkit should generate the nvidia.yaml file [systemd-unit] /Users/eduardoa/src/github/nvidia/nvidia-container-toolkit/tests/e2e/nvidia-cdi-refresh_test.go:110 • [2.388 seconds] ------------------------------ nvidia-cdi-refresh when installing nvidia-container-toolkit should refresh the nvidia.yaml file after upgrading the nvidia-container-toolkit [systemd-unit] /Users/eduardoa/src/github/nvidia/nvidia-container-toolkit/tests/e2e/nvidia-cdi-refresh_test.go:115 • [8.038 seconds]

elezar · 2025-08-26T14:33:20Z

packaging/debian/nvidia-container-toolkit-base.postinst

+        if command -v systemctl >/dev/null 2>&1; then
+          SYSTEMD_STATE=$(systemctl is-system-running 2>/dev/null || true)
+          case "$SYSTEMD_STATE" in
+            running|degraded|maintenance)


Question: What does maintenance mean in this context. SHOULD we be generating specs in this state?

From docs:

System Maintenance Mode is a temporary status for a computer or software system where user access is restricted or blocked to allow administrators to perform essential updates, backups, or other maintenance tasks safely. During this mode, the system often displays a message informing users that it is unavailable and will be back online once the maintenance is complete.

So I think a user should not be installing things during a maintenance period

#1273 drops maintenance

elezar · 2025-08-26T19:34:01Z

packaging/rpm/SPECS/nvidia-container-toolkit.spec

Let's create a PR with just these changes so that we can cut an rc.4 including them.

elezar · 2025-08-26T19:47:53Z

tests/e2e/nvidia-cdi-refresh_test.go

+	# Start the dummy service to force systemd to enter a degraded state
+	cat <<EOF > /etc/systemd/system/dummy.service
+[Unit]
+Description=Dummy systemd service
+
+[Service]
+Type=oneshot
+ExecStart=/usr/bin/sh -c "exit 0"
+EOF
+
+	# We know the dummy service will fail, so we can ignore the error
+	systemctl start dummy.service 2>/dev/null || true


Should this not only be a single configuration for the test case:
"The NVIDIA Container Toolkit installs and generates a CDI spec if the system is in a degraded state"?

I have turned this script just to create a degraded state and run it after testing on a healthy system, this way we test the 2 status we support running|degraded

elezar · 2025-08-26T19:48:47Z

tests/e2e/nvidia-cdi-refresh_test.go

+	# uninstall the nvidia-container-toolkit
+	apt-get remove -y nvidia-container-toolkit nvidia-container-toolkit-base libnvidia-container-tools libnvidia-container1
+	apt-get autoremove -y


Why is this not in an "AfterEach"?

In this script, I am uninstalling the toolkit on purpose to then get systemd on a degraded state and then re-install the toolkit to test that we support the degraded state. Given that this test is running on a nestedContainer the AfterAll removes the outher container

elezar · 2025-08-26T19:49:18Z

tests/e2e/nvidia-cdi-refresh_test.go

+	if ! systemctl status nvidia-cdi-refresh.path | grep "Active: active"; then
+		echo "nvidia-cdi-refresh.path is not Active"
+		exit 1
+	fi


Should this be an It clause?

Agree, I am now moving this to a stand alone template, and calling it from it's own It block

elezar · 2025-08-26T19:49:36Z

tests/e2e/nvidia-cdi-refresh_test.go

+	if ! systemctl status nvidia-cdi-refresh.service | grep "Loaded: loaded"; then
+		echo "nvidia-cdi-refresh.service is not loaded"
+		exit 1
+	fi


Should this be an It clause?

Agree, I am now moving this to a stand alone template, and calling it from it's own It block

elezar · 2025-08-26T19:50:40Z

tests/e2e/nvidia-cdi-refresh_test.go

+	# Touch the nvidia-ctk binary to change the mtime
+	# This will trigger the nvidia-cdi-refresh.path unit to call the
+	# nvidia-cdi-refresh.service unit, simulating a change(update/downgrade) in the nvidia-ctk binary.
+	touch $(which nvidia-ctk)


We could also install an older version of the toolkit and test upgrades then?

Yes, you are totally right, I have a note for this, my idea is to wait for v1.18.0 and then pin that as the old version. So for the scope of this PR touch is ok enough to fake/mock a new nvidia-ctk binary

elezar · 2025-08-26T19:54:12Z

tests/e2e/nvidia-container-toolkit_test.go

+			if runtime.GOOS == "darwin" {
+				tmpDirPath = path.Join("/tmp", uuid.NewString())
+			} else {
+				tmpDirPath = GinkgoT().TempDir()
+			}


This is the wrong fix for this. What this means is that we probably want to run mktemp -d below instead.

The issue here is that on Darwin GinkgoT().TempDir() returns

❯ env |grep TMP TMPDIR=/var/folders/nk/0fkkygw917n7797_jkxbdcjc0000gp/T/

resulting in mktemp -d /var/folders/nk/0fkkygw917n7797_jkxbdcjc0000gp/T/

elezar · 2025-08-26T19:57:00Z

tests/e2e/runner.go

+	// Mount the /lib/modules directory as a volume to enable the nvidia-cdi-refresh service
+	additionalContainerArguments = append(additionalContainerArguments, "-v /lib/modules:/lib/modules")


Quesiton: Why is this required?

On nvidia-cdi-refresh.service, we have ExecCondition=/usr/bin/grep -qE '/nvidia.ko' /lib/modules/%v/modules.dep, meaning that if that file doesn't exist, when the nvidia-cdi-refresh.path calls the nvidia-cdi-refresh.service, it will simply log the call and skip execution

elezar · 2025-08-26T20:01:20Z

tests/e2e/nvidia-cdi-refresh_test.go

+var _ = Describe("nvidia-cdi-refresh", Ordered, ContinueOnFailure, Label("systemd-unit"), func() {
+	var (
+		nestedContainerRunner Runner
+		outerContainerImage   = nodeimage.DefaultBaseImage


Although I understand the intent of this, I feel as if importing the nodeimage package and all its transitive dependencies JUST for this constant is not ideal. Let's first just hardcode the value here and then as a follow-up consider alternatives.

One option would be to use dependabot, or to try to get this by checking the dockerhub registry, for example.

ack, will hard code and note a TODO

ArangoGutierrez · 2025-08-27T14:22:46Z

Depends on #1273

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

ArangoGutierrez requested review from elezar and Copilot August 8, 2025 16:15

ArangoGutierrez self-assigned this Aug 8, 2025

Copilot AI reviewed Aug 8, 2025

View reviewed changes

elezar reviewed Aug 11, 2025

View reviewed changes

ArangoGutierrez requested a review from elezar August 11, 2025 13:31

ArangoGutierrez force-pushed the systemd_e2e branch from 7424fff to 5994198 Compare August 21, 2025 13:21

elezar reviewed Aug 22, 2025

View reviewed changes

ArangoGutierrez force-pushed the systemd_e2e branch from 5994198 to fcce5bd Compare August 22, 2025 13:08

ArangoGutierrez requested a review from elezar August 22, 2025 13:09

ArangoGutierrez force-pushed the systemd_e2e branch from fcce5bd to 90362f6 Compare August 22, 2025 16:10

elezar reviewed Aug 26, 2025

View reviewed changes

ArangoGutierrez force-pushed the systemd_e2e branch from 90362f6 to 8052564 Compare August 27, 2025 14:22

ArangoGutierrez force-pushed the systemd_e2e branch 2 times, most recently from 9e9def1 to 7ac71b0 Compare August 27, 2025 14:25

[no-relnote] implement a nestedContainerRunner for E2E test suite

b83fd1f

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

ArangoGutierrez added 3 commits August 28, 2025 10:47

[no-relnote] Move runner declaration to BeforeSuite

b2948bc

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

[no-relnote] Make E2E tests runnable from darwin

57b05af

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

[no-relnote] Add E2E tests for systemd unit

44fd1f4

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

ArangoGutierrez force-pushed the systemd_e2e branch from 7ac71b0 to 44fd1f4 Compare August 28, 2025 08:48

ArangoGutierrez requested a review from elezar August 28, 2025 08:48

	_, _, err := runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null \|\| true", containerName)) //nolint:errcheck
	_, _, err := runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null \|\| true", containerName))

	return nil, fmt.Errorf("no toolkit libraries found") //nolint:goerr113
	return nil, errors.New("no toolkit libraries found") //nolint:goerr113

	_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", installDockerTemplate))
	_, _, err = runner.Run(fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, installDockerTemplate))

	return r.runner.Run(fmt.Sprintf("docker exec -u root "+r.containerName+" bash -c '%s'", script))
	return r.runner.Run(fmt.Sprintf("docker exec -u root %s bash -c '%s'", r.containerName, script))

		// Mount the /lib/modules directory as a volume to enable the nvidia-cdi-refresh service
		additionalContainerArguments = append(additionalContainerArguments, "-v /lib/modules:/lib/modules")

[no-relnote] implement a nestedContainerRunner for E2E test suite #1235

Are you sure you want to change the base?

[no-relnote] implement a nestedContainerRunner for E2E test suite #1235

Uh oh!

Conversation

ArangoGutierrez commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

coveralls commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 17156139177

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elezar commented Aug 12, 2025

Uh oh!

ArangoGutierrez commented Aug 12, 2025

Uh oh!

elezar commented Aug 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez commented Aug 8, 2025 •

edited

Loading

coveralls commented Aug 8, 2025 •

edited

Loading