GPU library auto-detection fails when host usr lib path is /usr/lib64, requiring manual CODER_MOUNTS

## Problem

When using envbox with GPU passthrough in Kubernetes (`runtimeClassName: nvidia` + `nvidia.com/gpu` resource limits), setting `CODER_ADD_GPU=true` and `CODER_USR_LIB_DIR=/var/coder/usr/lib` correctly passes through `/dev/nvidia*` device nodes to the inner container, but the automatic library detection via `usrLibGPUs()` does not mount the required NVIDIA libraries into the inner container.

As a result, `nvidia-smi` inside the inner container fails with:

```
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
```

The outer container's `nvidia-smi` works fine.

## Workaround

Manually specifying the library mounts via `CODER_MOUNTS` resolves the issue:

```yaml
- name: CODER_MOUNTS
  value: "/var/coder/usr/lib/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1:ro,/var/coder/usr/lib/libnvidia-ptxjitcompiler.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1:ro,/var/coder/usr/lib/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro"
```

With both `CODER_ADD_GPU=true` (for device passthrough) and `CODER_MOUNTS` (for libraries), GPU passthrough works end-to-end without needing to manually recreate the inner container.

## Environment

- **envbox version:** 0.6.5
- **Kubernetes:** `runtimeClassName: nvidia` with `nvidia.com/gpu: "1"` resource limits
- **GPU:** Tesla T4
- **Host library path:** `/usr/lib64` mounted into the outer container at `/var/coder/usr/lib`
- **Inner image tested:** `nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2`

## Pod Spec (relevant sections)

```yaml
spec:
  runtimeClassName: nvidia
  containers:
  - image: ghcr.io/coder/envbox:0.6.5
    env:
    - name: CODER_INNER_IMAGE
      value: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    - name: CODER_INNER_USERNAME
      value: root
    - name: CODER_ADD_GPU
      value: "true"
    - name: CODER_USR_LIB_DIR
      value: /var/coder/usr/lib
    - name: CODER_MOUNTS
      value: "/var/coder/usr/lib/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1:ro,/var/coder/usr/lib/libnvidia-ptxjitcompiler.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1:ro,/var/coder/usr/lib/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro"
    resources:
      limits:
        nvidia.com/gpu: "1"
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /var/coder/usr/lib
      name: usr-lib
  volumes:
  - hostPath:
      path: /usr/lib64
      type: Directory
    name: usr-lib
```

## Expected Behavior

When `CODER_ADD_GPU=true` and `CODER_USR_LIB_DIR` are set, the `usrLibGPUs()` function should automatically detect and mount the NVIDIA libraries from the specified directory into the inner container, without requiring manual `CODER_MOUNTS`.

## Possible Cause

The `usrLibGPUs()` function walks the `CODER_USR_LIB_DIR` directory looking for files matching `(?i)(libgl(e|sx|\.)|nvidia|vulkan|cuda)` with `.so` extensions. When the host path is `/usr/lib64` (common on RHEL/Amazon Linux), the library layout or symlink structure may differ from `/usr/lib/x86_64-linux-gnu` (Debian/Ubuntu), which is the path used in all integration tests. The symlink resolution in `recursiveSymlinks()` or the path remapping logic in the GPU bind mount code may not handle this correctly.

Additionally, in the Kubernetes `runtimeClassName: nvidia` path (vs Docker `--runtime=nvidia --gpus=all`), the NVIDIA device plugin may inject libraries differently than the NVIDIA container runtime does when invoked via Docker directly.

---

Created on behalf of @uzair-coder07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU library auto-detection fails when host usr lib path is /usr/lib64, requiring manual CODER_MOUNTS #164

Problem

Workaround

Environment

Pod Spec (relevant sections)

Expected Behavior

Possible Cause

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU library auto-detection fails when host usr lib path is /usr/lib64, requiring manual CODER_MOUNTS #164

Description

Problem

Workaround

Environment

Pod Spec (relevant sections)

Expected Behavior

Possible Cause

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions