-
Notifications
You must be signed in to change notification settings - Fork 373
[no-relnote] Add E2E for libnvidia-container #1118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[no-relnote] Add E2E for libnvidia-container #1118
Conversation
Pull Request Test Coverage Report for Build 16348776804Details
💛 - Coveralls |
67cc2ec
to
903737e
Compare
d0a338e
to
8c42c14
Compare
Tests pass, PR ready for review @elezar |
8c42c14
to
d905a49
Compare
docker run -d --name test-nvidia-container-cli \ | ||
--privileged \ | ||
--runtime=nvidia \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not scalable to have to MOUNT everything into this container. Note that when we still had some simple integration tests in the toolkit we used
testing::docker::dind::setup() { |
Can we rather adapt this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will work on that. we need to make this test more robust and scalable
d905a49
to
9674787
Compare
9674787
to
1a72738
Compare
1a72738
to
f84c038
Compare
tests/e2e/installer.go
Outdated
# Create a temporary directory | ||
TEMP_DIR="/tmp/ctk_e2e.$(date +%s)_$RANDOM" | ||
mkdir -p "$TEMP_DIR" | ||
: ${IMAGE:={{.Image}}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Why did we swap the ordering of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general note to the scripts -- why are we using envvars at all and don't we just use {{.Image}}
everywhere? Is there a case where IMAGE
is already set to something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched back
var ( | ||
runner Runner | ||
testScript = "/tmp/libnvidia-container-cli.sh" | ||
dockerImage = "ghcr.io/nvidia/container-toolkit:5e8c1411-ubuntu20.04" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we hardcoding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does this break once we switch to distroless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded image was a mistake; it was an experimental feature introduced by me to iterate.
On distroless, it should work as we are setting the --entrypoint /libnvidia-container-cli.sh
, so as long as the distroless supports /usr/bin/env bash
scripts, this test should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now edited to adjust for the distroless image
imageName = getRequiredEnvvar[string]("E2E_IMAGE_NAME") | ||
imageTag = getRequiredEnvvar[string]("E2E_IMAGE_TAG") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we removed the conditional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, regardless of if we want to install or not the toolkit on the host, I want to be able to get these 2 variables.
@@ -28,11 +28,20 @@ var dockerInstallTemplate = ` | |||
#! /usr/bin/env bash | |||
set -xe | |||
: ${IMAGE:={{.Image}}} | |||
# if the TEMP_DIR is already set, use it | |||
if [ -f /tmp/ctk_e2e_temp_dir.txt ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we managing this state at a shell-script level? Should we not create the temp dir once in the go code and then use it here in our template?
// script are therefore a good indicator of whether the NVIDIA Container | ||
// Toolkit is functioning correctly inside the container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what we're testing. We're testing the nvidia-container-cli
specifically.
apt-get update -y && apt-get install -y curl gnupg2 | ||
WORKDIR="$(mktemp -d)" | ||
ROOTFS="${WORKDIR}/rootfs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need two directories? What about:
ROOTFS="${WORKDIR}/rootfs" | |
ROOTFS="$(mktemp -d)/rootfs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, added
var _ = Describe("nvidia-container-cli", Ordered, ContinueOnFailure, func() { | ||
var ( | ||
runner Runner | ||
testScript = "/tmp/libnvidia-container-cli.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we guaranteed a single script across all tests?
testScript = "/tmp/libnvidia-container-cli.sh" | ||
dockerImage = "ghcr.io/nvidia/container-toolkit:5e8c1411-ubuntu20.04" | ||
containerName = "nvidia-cli-e2e" | ||
dockerRunCmd string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a variable at this scope required?
createScriptCmd := fmt.Sprintf( | ||
"cat > %s <<'EOF'\n%s\nEOF\nchmod +x %s", | ||
testScript, libnvidiaContainerCliTestTemplate, testScript, | ||
) | ||
|
||
_, _, err := runner.Run(createScriptCmd) | ||
Expect(err).ToNot(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be in a BeforeAll
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as this comment made me realize #1118 (comment) and #1118 (comment) This test script is for this test case only. Yes, right now is the only test under this Describe
, but I prefer to keep it local to the It
so we are clear to add further test cases
// If a container with the same name exists from a previous test run, remove it first. | ||
runner.Run(fmt.Sprintf("docker rm -f %s", containerName)) | ||
|
||
// Build the docker run command (detached mode) from the template so it | ||
// stays readable while still resulting in a single-line invocation. | ||
dockerRunCmd = fmt.Sprintf(dockerRunCmdTemplate, containerName, testScript, dockerImage) | ||
|
||
// Launch the container in detached mode. | ||
_, _, err = runner.Run(dockerRunCmd) | ||
Expect(err).ToNot(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also be BeforeAll
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same reason as in #1118 (comment)
hostOutput, _, err := runner.Run("nvidia-smi -L") | ||
Expect(err).ToNot(HaveOccurred()) | ||
|
||
hostOutput = strings.TrimSpace(strings.ReplaceAll(hostOutput, "\r", "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we replacing returns with ""
and not " "
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We replace \r with the empty string to obtain a whitespace-minimal representation of each line before comparison; inserting a space would alter the content rather than merely normalizing it.
// Poll the logs of the already running container until we observe | ||
// the GPU list matching the host or until a 5-minute timeout elapses. | ||
Eventually(func() string { | ||
logs, _, err := runner.Run(fmt.Sprintf("docker logs %s", containerName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're just looking for the last log line, does docker logs --tail 1
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not aware of the flag, thanks!
f84c038
to
ecf9d64
Compare
0181dde
to
3cf4410
Compare
3cf4410
to
bf65552
Compare
bf65552
to
72738db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds an end-to-end test for the nvidia-container-cli
to catch regressions in the libnvidia-container
library.
- Introduces a Ginkgo-driven E2E test that mounts a minimal Ubuntu rootfs and runs
nvidia-smi
inside a container. - Updates the CTK installer script to persist and reuse a temp directory across runs.
- Simplifies environment variable loading and adds a focus filter in the Makefile for selective test runs.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
tests/e2e/nvidia-container-cli_test.go | Adds a Ginkgo test that builds a minimal rootfs, configures it with nvidia-container-cli , and verifies GPU visibility. |
tests/e2e/installer.go | Modifies the CTK installation template to read/write a persistent temp directory file. |
tests/e2e/e2e_test.go | Removes conditional guards around image name/tag env vars so they’re always loaded. |
tests/e2e/Makefile | Introduces GINKGO_FOCUS for targeted test execution and updates the ginkgo invocation. |
Comments suppressed due to low confidence (2)
tests/e2e/Makefile:23
- [nitpick] The comment says all tests run when
GINKGO_FOCUS
is unset, but the default isnvidia-container-cli
; update the comment to match or adjust the default to actually run all tests.
# If GINKGO_FOCUS is not set, run all tests
tests/e2e/nvidia-container-cli_test.go:37
- The
apt-get update
command doesn’t accept-y
and will error; remove the-y
flag so it runs successfully, e.g.,apt-get update && apt-get install -y curl gnupg2
.
apt-get update -y && apt-get install -y curl gnupg2
ace18c6
to
f0de6c1
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
f0de6c1
to
ec003a9
Compare
This patch adds an E2E test for the nvidia-container-cli that will allow us to catch regressions on libnvidia-container