Skip to content

Conversation

ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Aug 8, 2025

This pull request refactors the test setup for nvidia-container-cli end-to-end tests by introducing a new abstraction for running commands inside nested containers. The main logic for preparing and launching test containers, installing Docker and NVIDIA toolkit components, and executing test scripts has been moved from the test file to a new nestedContainerRunner type in runner.go. This change simplifies the test code and improves maintainability by encapsulating container setup and teardown logic.

The nestedContainerRunner also allows us to introduce a new E2E test for the nvidia-cdi-refresh systemd unit.

These changes make the end-to-end tests easier to read and maintain, and centralize the logic for running nested containers.

@ArangoGutierrez ArangoGutierrez self-assigned this Aug 8, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a nested container runner for the E2E test suite, allowing tests to run containers inside other containers using either remote or local runners.

  • Adds a new nestedContainerRunner type that wraps existing runners and executes commands inside nested containers
  • Moves Docker and Container Toolkit installation templates from test files to runner.go for reusability
  • Refactors the nvidia-container-cli test to use the new nested container runner instead of inline container setup

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
tests/e2e/runner.go Implements the core nestedContainerRunner with Docker/CTK installation templates and container management
tests/e2e/nvidia-container-cli_test.go Refactors test to use the new nested container runner, removing duplicate container setup code


// If a container with the same name exists from a previous test run, remove it first.
// Ignore errors as container might not exist
_, _, err := runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck
Copy link
Preview

Copilot AI Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error check is performed but the result is not used. The //nolint:errcheck comment suggests intentional error ignoring, but then the error is checked on line 155. Either remove the error check or remove the nolint comment.

Suggested change
_, _, err := runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck
_, _, err := runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName))

Copilot uses AI. Check for mistakes.


output = strings.TrimSpace(output)
if output == "" {
return nil, fmt.Errorf("no toolkit libraries found") //nolint:goerr113
Copy link
Preview

Copilot AI Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The //nolint:goerr113 comment suggests using wrapped errors is intentionally avoided, but using errors.New() would be more appropriate than fmt.Errorf() for static error messages without formatting.

Suggested change
return nil, fmt.Errorf("no toolkit libraries found") //nolint:goerr113
return nil, errors.New("no toolkit libraries found") //nolint:goerr113

Copilot uses AI. Check for mistakes.

}

// install docker in the nested container
_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", installDockerTemplate))
Copy link
Preview

Copilot AI Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] String concatenation in the middle of fmt.Sprintf is inconsistent with the formatting pattern. Consider using fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, installDockerTemplate) for better readability.

Suggested change
_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", installDockerTemplate))
_, _, err = runner.Run(fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, installDockerTemplate))

Copilot uses AI. Check for mistakes.

return nil, fmt.Errorf("failed to execute installCTK template: %w", err)
}

_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", toolkitInstall.String()))
Copy link
Preview

Copilot AI Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Same inconsistent string concatenation pattern as line 216. Consider using fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, toolkitInstall.String()) for consistency.

Suggested change
_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", toolkitInstall.String()))
_, _, err = runner.Run(fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, toolkitInstall.String()))

Copilot uses AI. Check for mistakes.

@@ -131,6 +301,10 @@ func (r remoteRunner) Run(script string) (string, string, error) {
return stdout.String(), "", nil
}

func (r nestedContainerRunner) Run(script string) (string, string, error) {
return r.runner.Run(fmt.Sprintf("docker exec -u root "+r.containerName+" bash -c '%s'", script))
Copy link
Preview

Copilot AI Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Same inconsistent string concatenation pattern. Consider using fmt.Sprintf("docker exec -u root %s bash -c '%s'", r.containerName, script) for consistency with the codebase formatting style.

Suggested change
return r.runner.Run(fmt.Sprintf("docker exec -u root "+r.containerName+" bash -c '%s'", script))
return r.runner.Run(fmt.Sprintf("docker exec -u root %s bash -c '%s'", r.containerName, script))

Copilot uses AI. Check for mistakes.

@coveralls
Copy link

coveralls commented Aug 8, 2025

Pull Request Test Coverage Report for Build 17156139177

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 35.749%

Totals Coverage Status
Change from base Build 17110221655: 0.0%
Covered Lines: 4657
Relevant Lines: 13027

💛 - Coveralls

Comment on lines 32 to 38
startTestContainerTemplate = `docker run -d --name {{.ContainerName}} --privileged --runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=runtime.nvidia.com/gpu=all \
-e NVIDIA_DRIVER_CAPABILITIES=all \
{{ range $i, $a := .AdditionalArguments -}}
{{ $a }} \
{{ end -}}
ubuntu sleep infinity`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not the only thing that is required for the nested container runner? The other scripts that have been moved here are test-specific. (The one exception is the additional arguments required to mount the host installed libraries into the container that is started).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really.
The template installDockerTemplate is needed to install Docker inside Docker to enable nested runs.
The template installCTKTemplate with the installCTK when calling NewNestedContainerRunner will allow us to create nestedRunners with or without the toolkit preinstalled inside the nested environment.

That was my reasoning to move all the templates to the runner.go, leaving only libnvidiaContainerCliTestTemplate on the nvidia-container-cli_test.go file, as this is indeed test specific template

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we add logic to install podman in the container, would that also be implemented in the runner.go. Also, looking forward, how do we reuse the nested runner to also run other toolkit tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we add logic to install podman in the container, would that also be implemented in the runner.go

We could modify the newNestedRunner to select the nested runtime.

looking forward, how do we reuse the nested runner to also run other toolkit tests?

You mean rewriting the current tests with the nestedRunner?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean that we should rewrite the current tests with the nestedRunner, but that the current tests should be able to run against a nestedRunner already.

As a question: from the perspective of the existing toolkit tests, would we expect a difference if these are running directly on an instance (i.e. the current runner), or in a containerized OS (the nested runner)? For all tests what we do is:

  • Start a test environment (an AWS instance, an outer container) with the NVIDIA Driver available
  • Install the prerequsites (currently Docker, but could be extended to Podman)

At this point we currently have a differences between the two classes of tests. For one of these we use toolkit container to install the toolkit to the host. For the second, we copy the packages out of the container and installs these on the "host" (actually the outer container). Note that for the systemd tests, this is what we would be "testing".

@ArangoGutierrez ArangoGutierrez requested a review from elezar August 11, 2025 13:31
@elezar
Copy link
Member

elezar commented Aug 12, 2025

As a general note. Our motiviation for adding this runner is to be able to test the generation of CDI specs by the systemd unit. Does it make sense to add that to this test as well so that the required functionality drives the types of changes we make?

@ArangoGutierrez
Copy link
Collaborator Author

As a general note. Our motiviation for adding this runner is to be able to test the generation of CDI specs by the systemd unit. Does it make sense to add that to this test as well so that the required functionality drives the types of changes we make?

OK, I just wanted a small, focused PR, but I am ok with expanding the scope of the PR. I'll work on it

@elezar
Copy link
Member

elezar commented Aug 19, 2025

As discussed offline: The intent of the "nested" Runner is to allow us to test other environments -- such as different target operating systems -- using our existin tests. With this in mind, we need to rework how we're constructing our "runner" and move it closer to being infrastructure instead of being so closely tied to each test.

Ideally the same tests should be able to be run against:

  • The version of the toolkit already installed on the system
  • The version under test in CI installed on an AWS instance
  • The version under test installed in an "outer" container with a specific distribution

Note that if the last of these can be made reliable enough, then the middle one (i.e. installing directly on an AWS instance) should not even be required.

)

const (
nvidiaCdiRefreshServiceTestTemplate = `
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there an advantage to having this as a single test script instead of separate It contexts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I have divided it into self-contained IT contexts, now it runs as follows:

nvidia-cdi-refresh when installing nvidia-container-toolkit should load the nvidia-cdi-refresh.path and nvidia-cdi-refresh.service units [systemd-unit]
/Users/eduardoa/src/github/nvidia/nvidia-container-toolkit/tests/e2e/nvidia-cdi-refresh_test.go:105
• [60.608 seconds]
------------------------------
nvidia-cdi-refresh when installing nvidia-container-toolkit should generate the nvidia.yaml file [systemd-unit]
/Users/eduardoa/src/github/nvidia/nvidia-container-toolkit/tests/e2e/nvidia-cdi-refresh_test.go:110
• [2.388 seconds]
------------------------------
nvidia-cdi-refresh when installing nvidia-container-toolkit should refresh the nvidia.yaml file after upgrading the nvidia-container-toolkit [systemd-unit]
/Users/eduardoa/src/github/nvidia/nvidia-container-toolkit/tests/e2e/nvidia-cdi-refresh_test.go:115
• [8.038 seconds]

if command -v systemctl >/dev/null 2>&1; then
SYSTEMD_STATE=$(systemctl is-system-running 2>/dev/null || true)
case "$SYSTEMD_STATE" in
running|degraded|maintenance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What does maintenance mean in this context. SHOULD we be generating specs in this state?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From docs:

System Maintenance Mode is a temporary status for a computer or software system where user access is restricted or blocked to allow administrators to perform essential updates, backups, or other maintenance tasks safely. During this mode, the system often displays a message informing users that it is unavailable and will be back online once the maintenance is complete.

So I think a user should not be installing things during a maintenance period

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1273 drops maintenance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a PR with just these changes so that we can cut an rc.4 including them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +52 to +62
# Start the dummy service to force systemd to enter a degraded state
cat <<EOF > /etc/systemd/system/dummy.service
[Unit]
Description=Dummy systemd service

[Service]
Type=oneshot
ExecStart=/usr/bin/sh -c "exit 0"
EOF

# We know the dummy service will fail, so we can ignore the error
systemctl start dummy.service 2>/dev/null || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not only be a single configuration for the test case:
"The NVIDIA Container Toolkit installs and generates a CDI spec if the system is in a degraded state"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have turned this script just to create a degraded state and run it after testing on a healthy system, this way we test the 2 status we support running|degraded

Comment on lines +36 to +37
# uninstall the nvidia-container-toolkit
apt-get remove -y nvidia-container-toolkit nvidia-container-toolkit-base libnvidia-container-tools libnvidia-container1
apt-get autoremove -y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not in an "AfterEach"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this script, I am uninstalling the toolkit on purpose to then get systemd on a degraded state and then re-install the toolkit to test that we support the degraded state. Given that this test is running on a nestedContainer the AfterAll removes the outher container

Comment on lines +68 to +71
if ! systemctl status nvidia-cdi-refresh.path | grep "Active: active"; then
echo "nvidia-cdi-refresh.path is not Active"
exit 1
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an It clause?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I am now moving this to a stand alone template, and calling it from it's own It block

Comment on lines +72 to +77
if ! systemctl status nvidia-cdi-refresh.service | grep "Loaded: loaded"; then
echo "nvidia-cdi-refresh.service is not loaded"
exit 1
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an It clause?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I am now moving this to a stand alone template, and calling it from it's own It block

Comment on lines +99 to +104
# Touch the nvidia-ctk binary to change the mtime
# This will trigger the nvidia-cdi-refresh.path unit to call the
# nvidia-cdi-refresh.service unit, simulating a change(update/downgrade) in the nvidia-ctk binary.
touch $(which nvidia-ctk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also install an older version of the toolkit and test upgrades then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are totally right, I have a note for this, my idea is to wait for v1.18.0 and then pin that as the old version. So for the scope of this PR touch is ok enough to fake/mock a new nvidia-ctk binary

Comment on lines +349 to +353
if runtime.GOOS == "darwin" {
tmpDirPath = path.Join("/tmp", uuid.NewString())
} else {
tmpDirPath = GinkgoT().TempDir()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong fix for this. What this means is that we probably want to run mktemp -d below instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that on Darwin GinkgoT().TempDir() returns

❯ env |grep TMP
TMPDIR=/var/folders/nk/0fkkygw917n7797_jkxbdcjc0000gp/T/

resulting in mktemp -d /var/folders/nk/0fkkygw917n7797_jkxbdcjc0000gp/T/

Comment on lines +237 to +238
// Mount the /lib/modules directory as a volume to enable the nvidia-cdi-refresh service
additionalContainerArguments = append(additionalContainerArguments, "-v /lib/modules:/lib/modules")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quesiton: Why is this required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On nvidia-cdi-refresh.service, we have ExecCondition=/usr/bin/grep -qE '/nvidia.ko' /lib/modules/%v/modules.dep, meaning that if that file doesn't exist, when the nvidia-cdi-refresh.path calls the nvidia-cdi-refresh.service, it will simply log the call and skip execution

var _ = Describe("nvidia-cdi-refresh", Ordered, ContinueOnFailure, Label("systemd-unit"), func() {
var (
nestedContainerRunner Runner
outerContainerImage = nodeimage.DefaultBaseImage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I understand the intent of this, I feel as if importing the nodeimage package and all its transitive dependencies JUST for this constant is not ideal. Let's first just hardcode the value here and then as a follow-up consider alternatives.

One option would be to use dependabot, or to try to get this by checking the dockerhub registry, for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, will hard code and note a TODO

@ArangoGutierrez
Copy link
Collaborator Author

Depends on #1273

@ArangoGutierrez ArangoGutierrez force-pushed the systemd_e2e branch 2 times, most recently from 9e9def1 to 7ac71b0 Compare August 27, 2025 14:25
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants