Problem
When using envbox with GPU passthrough in Kubernetes (runtimeClassName: nvidia + nvidia.com/gpu resource limits), setting CODER_ADD_GPU=true and CODER_USR_LIB_DIR=/var/coder/usr/lib correctly passes through /dev/nvidia* device nodes to the inner container, but the automatic library detection via usrLibGPUs() does not mount the required NVIDIA libraries into the inner container.
As a result, nvidia-smi inside the inner container fails with:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
The outer container's nvidia-smi works fine.
Workaround
Manually specifying the library mounts via CODER_MOUNTS resolves the issue:
- name: CODER_MOUNTS
value: "/var/coder/usr/lib/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1:ro,/var/coder/usr/lib/libnvidia-ptxjitcompiler.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1:ro,/var/coder/usr/lib/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro"
With both CODER_ADD_GPU=true (for device passthrough) and CODER_MOUNTS (for libraries), GPU passthrough works end-to-end without needing to manually recreate the inner container.
Environment
- envbox version: 0.6.5
- Kubernetes:
runtimeClassName: nvidia with nvidia.com/gpu: "1" resource limits
- GPU: Tesla T4
- Host library path:
/usr/lib64 mounted into the outer container at /var/coder/usr/lib
- Inner image tested:
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Pod Spec (relevant sections)
spec:
runtimeClassName: nvidia
containers:
- image: ghcr.io/coder/envbox:0.6.5
env:
- name: CODER_INNER_IMAGE
value: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
- name: CODER_INNER_USERNAME
value: root
- name: CODER_ADD_GPU
value: "true"
- name: CODER_USR_LIB_DIR
value: /var/coder/usr/lib
- name: CODER_MOUNTS
value: "/var/coder/usr/lib/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1:ro,/var/coder/usr/lib/libnvidia-ptxjitcompiler.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1:ro,/var/coder/usr/lib/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro"
resources:
limits:
nvidia.com/gpu: "1"
securityContext:
privileged: true
volumeMounts:
- mountPath: /var/coder/usr/lib
name: usr-lib
volumes:
- hostPath:
path: /usr/lib64
type: Directory
name: usr-lib
Expected Behavior
When CODER_ADD_GPU=true and CODER_USR_LIB_DIR are set, the usrLibGPUs() function should automatically detect and mount the NVIDIA libraries from the specified directory into the inner container, without requiring manual CODER_MOUNTS.
Possible Cause
The usrLibGPUs() function walks the CODER_USR_LIB_DIR directory looking for files matching (?i)(libgl(e|sx|\.)|nvidia|vulkan|cuda) with .so extensions. When the host path is /usr/lib64 (common on RHEL/Amazon Linux), the library layout or symlink structure may differ from /usr/lib/x86_64-linux-gnu (Debian/Ubuntu), which is the path used in all integration tests. The symlink resolution in recursiveSymlinks() or the path remapping logic in the GPU bind mount code may not handle this correctly.
Additionally, in the Kubernetes runtimeClassName: nvidia path (vs Docker --runtime=nvidia --gpus=all), the NVIDIA device plugin may inject libraries differently than the NVIDIA container runtime does when invoked via Docker directly.
Created on behalf of @uzair-coder07
Problem
When using envbox with GPU passthrough in Kubernetes (
runtimeClassName: nvidia+nvidia.com/gpuresource limits), settingCODER_ADD_GPU=trueandCODER_USR_LIB_DIR=/var/coder/usr/libcorrectly passes through/dev/nvidia*device nodes to the inner container, but the automatic library detection viausrLibGPUs()does not mount the required NVIDIA libraries into the inner container.As a result,
nvidia-smiinside the inner container fails with:The outer container's
nvidia-smiworks fine.Workaround
Manually specifying the library mounts via
CODER_MOUNTSresolves the issue:With both
CODER_ADD_GPU=true(for device passthrough) andCODER_MOUNTS(for libraries), GPU passthrough works end-to-end without needing to manually recreate the inner container.Environment
runtimeClassName: nvidiawithnvidia.com/gpu: "1"resource limits/usr/lib64mounted into the outer container at/var/coder/usr/libnvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2Pod Spec (relevant sections)
Expected Behavior
When
CODER_ADD_GPU=trueandCODER_USR_LIB_DIRare set, theusrLibGPUs()function should automatically detect and mount the NVIDIA libraries from the specified directory into the inner container, without requiring manualCODER_MOUNTS.Possible Cause
The
usrLibGPUs()function walks theCODER_USR_LIB_DIRdirectory looking for files matching(?i)(libgl(e|sx|\.)|nvidia|vulkan|cuda)with.soextensions. When the host path is/usr/lib64(common on RHEL/Amazon Linux), the library layout or symlink structure may differ from/usr/lib/x86_64-linux-gnu(Debian/Ubuntu), which is the path used in all integration tests. The symlink resolution inrecursiveSymlinks()or the path remapping logic in the GPU bind mount code may not handle this correctly.Additionally, in the Kubernetes
runtimeClassName: nvidiapath (vs Docker--runtime=nvidia --gpus=all), the NVIDIA device plugin may inject libraries differently than the NVIDIA container runtime does when invoked via Docker directly.Created on behalf of @uzair-coder07