Skip to content

[no-relnote] Add E2E for libnvidia-container #1118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented May 30, 2025

This patch adds an E2E test for the nvidia-container-cli that will allow us to catch regressions on libnvidia-container

@ArangoGutierrez ArangoGutierrez requested review from elezar and Copilot May 30, 2025 14:35
@ArangoGutierrez ArangoGutierrez self-assigned this May 30, 2025
Copilot

This comment was marked as outdated.

@coveralls
Copy link

coveralls commented May 30, 2025

Pull Request Test Coverage Report for Build 16348776804

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 35.012%

Totals Coverage Status
Change from base Build 16219798607: 0.0%
Covered Lines: 4442
Relevant Lines: 12687

💛 - Coveralls

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 67cc2ec to 903737e Compare May 30, 2025 15:32
@ArangoGutierrez ArangoGutierrez requested a review from elezar May 30, 2025 15:32
@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch 5 times, most recently from d0a338e to 8c42c14 Compare June 3, 2025 18:51
@ArangoGutierrez
Copy link
Collaborator Author

Tests pass, PR ready for review @elezar

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 8c42c14 to d905a49 Compare June 4, 2025 08:13
Comment on lines 31 to 33
docker run -d --name test-nvidia-container-cli \
--privileged \
--runtime=nvidia \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not scalable to have to MOUNT everything into this container. Note that when we still had some simple integration tests in the toolkit we used

testing::docker::dind::setup() {

Can we rather adapt this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will work on that. we need to make this test more robust and scalable

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from d905a49 to 9674787 Compare June 5, 2025 16:18
@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 9674787 to 1a72738 Compare July 16, 2025 14:03
Copilot

This comment was marked as outdated.

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 1a72738 to f84c038 Compare July 16, 2025 15:27
# Create a temporary directory
TEMP_DIR="/tmp/ctk_e2e.$(date +%s)_$RANDOM"
mkdir -p "$TEMP_DIR"
: ${IMAGE:={{.Image}}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Why did we swap the ordering of these?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general note to the scripts -- why are we using envvars at all and don't we just use {{.Image}} everywhere? Is there a case where IMAGE is already set to something else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched back

var (
runner Runner
testScript = "/tmp/libnvidia-container-cli.sh"
dockerImage = "ghcr.io/nvidia/container-toolkit:5e8c1411-ubuntu20.04"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we hardcoding this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does this break once we switch to distroless?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded image was a mistake; it was an experimental feature introduced by me to iterate.

On distroless, it should work as we are setting the --entrypoint /libnvidia-container-cli.sh , so as long as the distroless supports /usr/bin/env bash scripts, this test should work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now edited to adjust for the distroless image

Comment on lines +65 to +66
imageName = getRequiredEnvvar[string]("E2E_IMAGE_NAME")
imageTag = getRequiredEnvvar[string]("E2E_IMAGE_TAG")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we removed the conditional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, regardless of if we want to install or not the toolkit on the host, I want to be able to get these 2 variables.

@@ -28,11 +28,20 @@ var dockerInstallTemplate = `
#! /usr/bin/env bash
set -xe
: ${IMAGE:={{.Image}}}
# if the TEMP_DIR is already set, use it
if [ -f /tmp/ctk_e2e_temp_dir.txt ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we managing this state at a shell-script level? Should we not create the temp dir once in the go code and then use it here in our template?

Comment on lines 32 to 33
// script are therefore a good indicator of whether the NVIDIA Container
// Toolkit is functioning correctly inside the container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what we're testing. We're testing the nvidia-container-cli specifically.

apt-get update -y && apt-get install -y curl gnupg2
WORKDIR="$(mktemp -d)"
ROOTFS="${WORKDIR}/rootfs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two directories? What about:

Suggested change
ROOTFS="${WORKDIR}/rootfs"
ROOTFS="$(mktemp -d)/rootfs"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, added

var _ = Describe("nvidia-container-cli", Ordered, ContinueOnFailure, func() {
var (
runner Runner
testScript = "/tmp/libnvidia-container-cli.sh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we guaranteed a single script across all tests?

testScript = "/tmp/libnvidia-container-cli.sh"
dockerImage = "ghcr.io/nvidia/container-toolkit:5e8c1411-ubuntu20.04"
containerName = "nvidia-cli-e2e"
dockerRunCmd string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a variable at this scope required?

Comment on lines 125 to 134
createScriptCmd := fmt.Sprintf(
"cat > %s <<'EOF'\n%s\nEOF\nchmod +x %s",
testScript, libnvidiaContainerCliTestTemplate, testScript,
)

_, _, err := runner.Run(createScriptCmd)
Expect(err).ToNot(HaveOccurred())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be in a BeforeAll?

Copy link
Collaborator Author

@ArangoGutierrez ArangoGutierrez Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as this comment made me realize #1118 (comment) and #1118 (comment) This test script is for this test case only. Yes, right now is the only test under this Describe, but I prefer to keep it local to the It so we are clear to add further test cases

Comment on lines 133 to 146
// If a container with the same name exists from a previous test run, remove it first.
runner.Run(fmt.Sprintf("docker rm -f %s", containerName))

// Build the docker run command (detached mode) from the template so it
// stays readable while still resulting in a single-line invocation.
dockerRunCmd = fmt.Sprintf(dockerRunCmdTemplate, containerName, testScript, dockerImage)

// Launch the container in detached mode.
_, _, err = runner.Run(dockerRunCmd)
Expect(err).ToNot(HaveOccurred())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be BeforeAll?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reason as in #1118 (comment)

hostOutput, _, err := runner.Run("nvidia-smi -L")
Expect(err).ToNot(HaveOccurred())

hostOutput = strings.TrimSpace(strings.ReplaceAll(hostOutput, "\r", ""))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we replacing returns with "" and not " "?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We replace \r with the empty string to obtain a whitespace-minimal representation of each line before comparison; inserting a space would alter the content rather than merely normalizing it.

// Poll the logs of the already running container until we observe
// the GPU list matching the host or until a 5-minute timeout elapses.
Eventually(func() string {
logs, _, err := runner.Run(fmt.Sprintf("docker logs %s", containerName))
Copy link
Member

@elezar elezar Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're just looking for the last log line, does docker logs --tail 1 work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of the flag, thanks!

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from f84c038 to ecf9d64 Compare July 17, 2025 09:08
Copilot

This comment was marked as outdated.

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch 2 times, most recently from 0181dde to 3cf4410 Compare July 17, 2025 09:14
@ArangoGutierrez ArangoGutierrez requested a review from Copilot July 17, 2025 09:14
Copilot

This comment was marked as outdated.

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from 3cf4410 to bf65552 Compare July 17, 2025 09:15
@ArangoGutierrez ArangoGutierrez requested a review from Copilot July 17, 2025 09:16
Copilot

This comment was marked as outdated.

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from bf65552 to 72738db Compare July 17, 2025 09:20
@ArangoGutierrez ArangoGutierrez requested a review from Copilot July 17, 2025 09:21
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds an end-to-end test for the nvidia-container-cli to catch regressions in the libnvidia-container library.

  • Introduces a Ginkgo-driven E2E test that mounts a minimal Ubuntu rootfs and runs nvidia-smi inside a container.
  • Updates the CTK installer script to persist and reuse a temp directory across runs.
  • Simplifies environment variable loading and adds a focus filter in the Makefile for selective test runs.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/e2e/nvidia-container-cli_test.go Adds a Ginkgo test that builds a minimal rootfs, configures it with nvidia-container-cli, and verifies GPU visibility.
tests/e2e/installer.go Modifies the CTK installation template to read/write a persistent temp directory file.
tests/e2e/e2e_test.go Removes conditional guards around image name/tag env vars so they’re always loaded.
tests/e2e/Makefile Introduces GINKGO_FOCUS for targeted test execution and updates the ginkgo invocation.
Comments suppressed due to low confidence (2)

tests/e2e/Makefile:23

  • [nitpick] The comment says all tests run when GINKGO_FOCUS is unset, but the default is nvidia-container-cli; update the comment to match or adjust the default to actually run all tests.
# If GINKGO_FOCUS is not set, run all tests

tests/e2e/nvidia-container-cli_test.go:37

  • The apt-get update command doesn’t accept -y and will error; remove the -y flag so it runs successfully, e.g., apt-get update && apt-get install -y curl gnupg2.
apt-get update -y && apt-get install -y curl gnupg2

@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch 9 times, most recently from ace18c6 to f0de6c1 Compare July 17, 2025 14:43
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez ArangoGutierrez force-pushed the e2e/nvidia-container-cli branch from f0de6c1 to ec003a9 Compare July 17, 2025 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants