-
Notifications
You must be signed in to change notification settings - Fork 407
[no-relnote] implement a nestedContainerRunner for E2E test suite #1235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a nested container runner for the E2E test suite, allowing tests to run containers inside other containers using either remote or local runners.
- Adds a new
nestedContainerRunner
type that wraps existing runners and executes commands inside nested containers - Moves Docker and Container Toolkit installation templates from test files to runner.go for reusability
- Refactors the nvidia-container-cli test to use the new nested container runner instead of inline container setup
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
File | Description |
---|---|
tests/e2e/runner.go | Implements the core nestedContainerRunner with Docker/CTK installation templates and container management |
tests/e2e/nvidia-container-cli_test.go | Refactors test to use the new nested container runner, removing duplicate container setup code |
|
||
// If a container with the same name exists from a previous test run, remove it first. | ||
// Ignore errors as container might not exist | ||
_, _, err := runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error check is performed but the result is not used. The //nolint:errcheck
comment suggests intentional error ignoring, but then the error is checked on line 155. Either remove the error check or remove the nolint comment.
_, _, err := runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck | |
_, _, err := runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) |
Copilot uses AI. Check for mistakes.
|
||
output = strings.TrimSpace(output) | ||
if output == "" { | ||
return nil, fmt.Errorf("no toolkit libraries found") //nolint:goerr113 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The //nolint:goerr113
comment suggests using wrapped errors is intentionally avoided, but using errors.New()
would be more appropriate than fmt.Errorf()
for static error messages without formatting.
return nil, fmt.Errorf("no toolkit libraries found") //nolint:goerr113 | |
return nil, errors.New("no toolkit libraries found") //nolint:goerr113 |
Copilot uses AI. Check for mistakes.
} | ||
|
||
// install docker in the nested container | ||
_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", installDockerTemplate)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] String concatenation in the middle of fmt.Sprintf is inconsistent with the formatting pattern. Consider using fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, installDockerTemplate)
for better readability.
_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", installDockerTemplate)) | |
_, _, err = runner.Run(fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, installDockerTemplate)) |
Copilot uses AI. Check for mistakes.
return nil, fmt.Errorf("failed to execute installCTK template: %w", err) | ||
} | ||
|
||
_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", toolkitInstall.String())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Same inconsistent string concatenation pattern as line 216. Consider using fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, toolkitInstall.String())
for consistency.
_, _, err = runner.Run(fmt.Sprintf("docker exec -u root "+containerName+" bash -c '%s'", toolkitInstall.String())) | |
_, _, err = runner.Run(fmt.Sprintf("docker exec -u root %s bash -c '%s'", containerName, toolkitInstall.String())) |
Copilot uses AI. Check for mistakes.
@@ -131,6 +301,10 @@ func (r remoteRunner) Run(script string) (string, string, error) { | |||
return stdout.String(), "", nil | |||
} | |||
|
|||
func (r nestedContainerRunner) Run(script string) (string, string, error) { | |||
return r.runner.Run(fmt.Sprintf("docker exec -u root "+r.containerName+" bash -c '%s'", script)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Same inconsistent string concatenation pattern. Consider using fmt.Sprintf("docker exec -u root %s bash -c '%s'", r.containerName, script)
for consistency with the codebase formatting style.
return r.runner.Run(fmt.Sprintf("docker exec -u root "+r.containerName+" bash -c '%s'", script)) | |
return r.runner.Run(fmt.Sprintf("docker exec -u root %s bash -c '%s'", r.containerName, script)) |
Copilot uses AI. Check for mistakes.
Pull Request Test Coverage Report for Build 17156139177Details
💛 - Coveralls |
tests/e2e/runner.go
Outdated
startTestContainerTemplate = `docker run -d --name {{.ContainerName}} --privileged --runtime=nvidia \ | ||
-e NVIDIA_VISIBLE_DEVICES=runtime.nvidia.com/gpu=all \ | ||
-e NVIDIA_DRIVER_CAPABILITIES=all \ | ||
{{ range $i, $a := .AdditionalArguments -}} | ||
{{ $a }} \ | ||
{{ end -}} | ||
ubuntu sleep infinity` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not the only thing that is required for the nested container runner? The other scripts that have been moved here are test-specific. (The one exception is the additional arguments required to mount the host installed libraries into the container that is started).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really.
The template installDockerTemplate
is needed to install Docker inside Docker to enable nested runs.
The template installCTKTemplate
with the installCTK
when calling NewNestedContainerRunner
will allow us to create nestedRunners
with or without the toolkit preinstalled inside the nested environment.
That was my reasoning to move all the templates to the runner.go
, leaving only libnvidiaContainerCliTestTemplate
on the nvidia-container-cli_test.go
file, as this is indeed test specific template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we add logic to install podman
in the container, would that also be implemented in the runner.go
. Also, looking forward, how do we reuse the nested runner to also run other toolkit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we add logic to install podman in the container, would that also be implemented in the runner.go
We could modify the newNestedRunner
to select the nested
runtime.
looking forward, how do we reuse the nested runner to also run other toolkit tests?
You mean rewriting the current tests with the nestedRunner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean that we should rewrite the current tests with the nestedRunner
, but that the current tests should be able to run against a nestedRunner
already.
As a question: from the perspective of the existing toolkit tests, would we expect a difference if these are running directly on an instance (i.e. the current runner), or in a containerized OS (the nested runner)? For all tests what we do is:
- Start a test environment (an AWS instance, an outer container) with the NVIDIA Driver available
- Install the prerequsites (currently Docker, but could be extended to Podman)
At this point we currently have a differences between the two classes of tests. For one of these we use toolkit container to install the toolkit to the host. For the second, we copy the packages out of the container and installs these on the "host" (actually the outer container). Note that for the systemd tests, this is what we would be "testing".
As a general note. Our motiviation for adding this runner is to be able to test the generation of CDI specs by the systemd unit. Does it make sense to add that to this test as well so that the required functionality drives the types of changes we make? |
OK, I just wanted a small, focused PR, but I am ok with expanding the scope of the PR. I'll work on it |
As discussed offline: The intent of the "nested" Runner is to allow us to test other environments -- such as different target operating systems -- using our existin tests. With this in mind, we need to rework how we're constructing our "runner" and move it closer to being infrastructure instead of being so closely tied to each test. Ideally the same tests should be able to be run against:
Note that if the last of these can be made reliable enough, then the middle one (i.e. installing directly on an AWS instance) should not even be required. |
7424fff
to
5994198
Compare
tests/e2e/nvidia-cdi-refresh_test.go
Outdated
) | ||
|
||
const ( | ||
nvidiaCdiRefreshServiceTestTemplate = ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is there an advantage to having this as a single test script instead of separate It
contexts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I have divided it into self-contained IT
contexts, now it runs as follows:
nvidia-cdi-refresh when installing nvidia-container-toolkit should load the nvidia-cdi-refresh.path and nvidia-cdi-refresh.service units [systemd-unit]
/Users/eduardoa/src/github/nvidia/nvidia-container-toolkit/tests/e2e/nvidia-cdi-refresh_test.go:105
• [60.608 seconds]
------------------------------
nvidia-cdi-refresh when installing nvidia-container-toolkit should generate the nvidia.yaml file [systemd-unit]
/Users/eduardoa/src/github/nvidia/nvidia-container-toolkit/tests/e2e/nvidia-cdi-refresh_test.go:110
• [2.388 seconds]
------------------------------
nvidia-cdi-refresh when installing nvidia-container-toolkit should refresh the nvidia.yaml file after upgrading the nvidia-container-toolkit [systemd-unit]
/Users/eduardoa/src/github/nvidia/nvidia-container-toolkit/tests/e2e/nvidia-cdi-refresh_test.go:115
• [8.038 seconds]
5994198
to
fcce5bd
Compare
fcce5bd
to
90362f6
Compare
if command -v systemctl >/dev/null 2>&1; then | ||
SYSTEMD_STATE=$(systemctl is-system-running 2>/dev/null || true) | ||
case "$SYSTEMD_STATE" in | ||
running|degraded|maintenance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What does maintenance
mean in this context. SHOULD we be generating specs in this state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From docs:
System Maintenance Mode is a temporary status for a computer or software system where user access is restricted or blocked to allow administrators to perform essential updates, backups, or other maintenance tasks safely. During this mode, the system often displays a message informing users that it is unavailable and will be back online once the maintenance is complete.
So I think a user should not be installing things during a maintenance
period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1273 drops maintenance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's create a PR with just these changes so that we can cut an rc.4
including them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Start the dummy service to force systemd to enter a degraded state | ||
cat <<EOF > /etc/systemd/system/dummy.service | ||
[Unit] | ||
Description=Dummy systemd service | ||
|
||
[Service] | ||
Type=oneshot | ||
ExecStart=/usr/bin/sh -c "exit 0" | ||
EOF | ||
|
||
# We know the dummy service will fail, so we can ignore the error | ||
systemctl start dummy.service 2>/dev/null || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not only be a single configuration for the test case:
"The NVIDIA Container Toolkit installs and generates a CDI spec if the system is in a degraded state"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have turned this script just to create a degraded state and run it after testing on a healthy system, this way we test the 2 status we support running|degraded
# uninstall the nvidia-container-toolkit | ||
apt-get remove -y nvidia-container-toolkit nvidia-container-toolkit-base libnvidia-container-tools libnvidia-container1 | ||
apt-get autoremove -y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not in an "AfterEach"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this script, I am uninstalling the toolkit on purpose to then get systemd
on a degraded state and then re-install the toolkit to test that we support the degraded state. Given that this test is running on a nestedContainer
the AfterAll
removes the outher
container
if ! systemctl status nvidia-cdi-refresh.path | grep "Active: active"; then | ||
echo "nvidia-cdi-refresh.path is not Active" | ||
exit 1 | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an It
clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I am now moving this to a stand alone template, and calling it from it's own It
block
if ! systemctl status nvidia-cdi-refresh.service | grep "Loaded: loaded"; then | ||
echo "nvidia-cdi-refresh.service is not loaded" | ||
exit 1 | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an It
clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I am now moving this to a stand alone template, and calling it from it's own It
block
# Touch the nvidia-ctk binary to change the mtime | ||
# This will trigger the nvidia-cdi-refresh.path unit to call the | ||
# nvidia-cdi-refresh.service unit, simulating a change(update/downgrade) in the nvidia-ctk binary. | ||
touch $(which nvidia-ctk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also install an older version of the toolkit and test upgrades then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are totally right, I have a note for this, my idea is to wait for v1.18.0
and then pin that as the old
version. So for the scope of this PR touch
is ok enough to fake/mock a new nvidia-ctk
binary
if runtime.GOOS == "darwin" { | ||
tmpDirPath = path.Join("/tmp", uuid.NewString()) | ||
} else { | ||
tmpDirPath = GinkgoT().TempDir() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the wrong fix for this. What this means is that we probably want to run mktemp -d
below instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is that on Darwin GinkgoT().TempDir()
returns
❯ env |grep TMP
TMPDIR=/var/folders/nk/0fkkygw917n7797_jkxbdcjc0000gp/T/
resulting in mktemp -d /var/folders/nk/0fkkygw917n7797_jkxbdcjc0000gp/T/
// Mount the /lib/modules directory as a volume to enable the nvidia-cdi-refresh service | ||
additionalContainerArguments = append(additionalContainerArguments, "-v /lib/modules:/lib/modules") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quesiton: Why is this required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On nvidia-cdi-refresh.service
, we have ExecCondition=/usr/bin/grep -qE '/nvidia.ko' /lib/modules/%v/modules.dep
, meaning that if that file doesn't exist, when the nvidia-cdi-refresh.path
calls the nvidia-cdi-refresh.service
, it will simply log the call and skip execution
tests/e2e/nvidia-cdi-refresh_test.go
Outdated
var _ = Describe("nvidia-cdi-refresh", Ordered, ContinueOnFailure, Label("systemd-unit"), func() { | ||
var ( | ||
nestedContainerRunner Runner | ||
outerContainerImage = nodeimage.DefaultBaseImage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I understand the intent of this, I feel as if importing the nodeimage
package and all its transitive dependencies JUST for this constant is not ideal. Let's first just hardcode the value here and then as a follow-up consider alternatives.
One option would be to use dependabot, or to try to get this by checking the dockerhub registry, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, will hard code and note a TODO
90362f6
to
8052564
Compare
Depends on #1273 |
9e9def1
to
7ac71b0
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
7ac71b0
to
44fd1f4
Compare
This pull request refactors the test setup for
nvidia-container-cli
end-to-end tests by introducing a new abstraction for running commands inside nested containers. The main logic for preparing and launching test containers, installing Docker and NVIDIA toolkit components, and executing test scripts has been moved from the test file to a newnestedContainerRunner
type inrunner.go
. This change simplifies the test code and improves maintainability by encapsulating container setup and teardown logic.The nestedContainerRunner also allows us to introduce a new E2E test for the
nvidia-cdi-refresh
systemd unit.These changes make the end-to-end tests easier to read and maintain, and centralize the logic for running nested containers.