Skip to content

Commit 7b0e2ee

Browse files
[no-relnote] Add E2E documentation
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
1 parent c75e5d9 commit 7b0e2ee

File tree

1 file changed

+140
-0
lines changed

1 file changed

+140
-0
lines changed

tests/e2e/README.md

+140
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# NVIDIA Container Toolkit – End‑to‑End (E2E) Test Suite
19+
20+
---
21+
22+
## 1 Scope & Goals
23+
This repository contains a **Ginkgo v2 / Gomega** test harness that exercises an
24+
NVIDIA Container Toolkit (CTK) installation on a **remote GPU‑enabled host** via
25+
SSH. The suite validates that:
26+
27+
1. CTK can be installed (or upgraded) head‑less (`INSTALL_CTK=true`).
28+
2. The specified **container image** runs successfully under `nvidia-container-runtime`.
29+
3. Errors and diagnostics are captured for post‑mortem analysis.
30+
31+
The tests are intended for continuous‑integration pipelines, nightly
32+
compatibility runs, and pre‑release validation of new CTK builds.
33+
34+
---
35+
36+
## 2 Execution model
37+
* The framework **does not** spin up a Kubernetes cluster; it drives a single
38+
host reachable over SSH.
39+
* All commands run in a Ginkgo‑managed context (`ctx`) so they abort cleanly on
40+
timeout or Ctrl‑C.
41+
* Environment discovery happens once in `TestMain``getTestEnv()`; parameters
42+
are therefore immutable for the duration of the run.
43+
44+
---
45+
46+
## 3 Prerequisites
47+
48+
| Item | Version / requirement |
49+
|------|-----------------------|
50+
| **Go toolchain** | ≥ 1.22 (for building Ginkgo helper binaries) |
51+
| **GPU‑enabled Linux host** | Running a supported NVIDIA driver; reachable via SSH |
52+
| **SSH connectivity** | Public‑key authentication *without* pass‑phrase for unattended CI |
53+
| **Local OS** | Linux/macOS; POSIX shell required by the Makefile |
54+
55+
---
56+
57+
## 4 Environment variables
58+
59+
| Variable | Required | Example | Description |
60+
|----------|----------|---------|-------------|
61+
| `INSTALL_CTK` || `true` | When `true` the test installs CTK on the remote host before running the image. When `false` it assumes CTK is already present. |
62+
| `TOOLKIT_IMAGE` || `nvcr.io/nvidia/cuda:12.4.0-runtime-ubi9` | Image that will be pulled & executed. |
63+
| `SSH_KEY` || `/home/ci/.ssh/id_rsa` | Private key used for authentication. |
64+
| `SSH_USER` || `ubuntu` | Username on the remote host. |
65+
| `REMOTE_HOST` || `gpurunner01.corp.local` | Hostname or IP address of the target node. |
66+
| `REMOTE_PORT` || `22` | SSH port of the target node. |
67+
68+
> All variables are validated at start‑up; the suite aborts early with a clear
69+
> message if any are missing or ill‑formed.
70+
71+
---
72+
73+
## 5 Build helper binaries
74+
75+
Install the latest Ginkgo CLI locally so that the Makefile can invoke it:
76+
77+
```bash
78+
make ginkgo # installs ./bin/ginkgo
79+
```
80+
81+
The Makefile entry mirrors the pattern used in other NVIDIA E2E suites:
82+
83+
```make
84+
bin/ginkgo:
85+
GOBIN=$(CURDIR)/bin go install github.com/onsi/ginkgo/v2/ginkgo@latest
86+
```
87+
88+
---
89+
90+
## 6 Running the suite
91+
92+
### 6.1 Basic invocation
93+
```bash
94+
INSTALL_CTK=true \
95+
TOOLKIT_IMAGE=nvcr.io/nvidia/cuda:12.4.0-runtime-ubi9 \
96+
SSH_KEY=$HOME/.ssh/id_rsa \
97+
SSH_USER=ubuntu \
98+
REMOTE_HOST=10.0.0.15 \
99+
REMOTE_PORT=22 \
100+
make test-e2e
101+
```
102+
This downloads the image on the remote host, installs CTK (if requested), and
103+
executes a minimal CUDA‑based workload.
104+
105+
---
106+
107+
## 7 Internal test flow
108+
109+
| Phase | Key function(s) | Notes |
110+
|-------|-----------------|-------|
111+
| **Init** | `TestMain``getTestEnv` | Collects env vars, initializes `ctx`. |
112+
| **Connection check** | `BeforeSuite` (not shown) | Verifies SSH reachability using `ssh -o BatchMode=yes`. |
113+
| **Optional CTK install** | `installCTK == true` path | Runs the distro‑specific install script on the remote host. |
114+
| **Runtime validation** | Leaf `It` blocks | Pulls `TOOLKIT_IMAGE`, runs `nvidia-smi` inside the container, asserts exit code `0`. |
115+
| **Failure diagnostics** | `AfterEach` | Copies `/var/log/nvidia-container-runtime.log` & dmesg to `${LOG_ARTIFACTS_DIR}` via `scp`. |
116+
117+
---
118+
119+
## 8 Extending the suite
120+
121+
1. Create a new `_test.go` file under `tests/e2e`.
122+
2. Use the Ginkgo DSL (`Describe`, `When`, `It` …). Each leaf node receives a
123+
`context.Context` so you can run remote commands with deadline control.
124+
3. Helper utilities such as `runSSH`, `withSudo`, and `collectLogs` are already
125+
available from the shared test harness (see `ssh_helpers.go`).
126+
4. Keep tests **idempotent** and clean any artefacts you create on the host.
127+
128+
---
129+
130+
## 9 Common issues & fixes
131+
132+
| Symptom | Likely cause | Fix |
133+
|---------|--------------|-----|
134+
| `Permission denied (publickey)` | Wrong `SSH_KEY` or `SSH_USER` | Check variables; ensure key is readable by the CI user. |
135+
| `docker: Error response from daemon: could not select device driver` | CTK not installed or wrong runtime class | Verify `INSTALL_CTK=true` or confirm CTK installation on the host. |
136+
| Test hangs at image pull | No outbound internet on remote host | Pre‑load the image or use a local registry mirror. |
137+
138+
## 10 License
139+
Distributed under the terms of the **Apache License 2.0** (see header).
140+

0 commit comments

Comments
 (0)