Skip to content

nvidia-container-toolkit: Some cuda libraries do not work without extra LD_LIBRARY_PATH #366109

@sliedes

Description

@sliedes

Describe the bug

I'm running this in a docker image with GPU:

https://gitlab.com/scripta/escriptorium/-/wikis/docker-install

GPU training failed out of the box suggesting libcuda.so cannot be loaded:

celery-gpu-1           | GPU available: True (cuda), used: True
celery-gpu-1           | TPU available: False, using: 0 TPU cores
celery-gpu-1           | IPU available: False, using: 0 IPUs
celery-gpu-1           | HPU available: False, using: 0 HPUs
celery-gpu-1           | `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
celery-gpu-1           | You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
celery-gpu-1           | [2024-12-18 09:09:03,469: INFO/ForkPoolWorker-1] Creating new model [1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do0.1,2 Lbx200 Do] with 77 outputs
celery-gpu-1           | [2024-12-18 09:09:03,680: INFO/ForkPoolWorker-1] Adding 1 dummy labels to validation set codec.
celery-gpu-1           | [2024-12-18 09:09:03,686: INFO/ForkPoolWorker-1] Setting seg_type to baselines.
celery-gpu-1           | LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
celery-gpu-1           | Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory
celery-gpu-1           | [2024-12-18 09:09:17,657: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:221 exited with 'signal 6 (SIGABRT)'
celery-gpu-1           | [2024-12-18 09:09:17,669: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 6 (SIGABRT) Job: 0.')
celery-gpu-1           | Traceback (most recent call last):
celery-gpu-1           |   File "/usr/local/lib/python3.8/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
celery-gpu-1           |     raise WorkerLostError(
celery-gpu-1           | billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 6 (SIGABRT) Job: 0.

Adding LD_LIBRARY_PATH=/usr/local/nvidia/lib64 to the environment fixes this issue.

I believe this happens because the /nix/store path that is in ld.so search path only contains libcuda.so.1 while /usr/local/nvidia also contains libcuda.so:

# ls -l /usr/local/nvidia/lib64/libcuda.so*
lrwxrwxrwx 1 root root       12 Jan  1  1970 /usr/local/nvidia/lib64/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       17 Jan  1  1970 /usr/local/nvidia/lib64/libcuda.so.1 -> libcuda.so.565.77
-r-xr-xr-x 1 root root 49572768 Jan  1  1970 /usr/local/nvidia/lib64/libcuda.so.565.77
# cat /etc/ld.so.conf.d/nvcr-3734471176.conf
/nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib
# ls -l /nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/libcuda.so*
lrwxrwxrwx 1 root root       17 Dec 18 09:34 /nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/libcuda.so.1 -> libcuda.so.565.77
-r-xr-xr-x 1 root root 49572768 Jan  1  1970 /nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/libcuda.so.565.77

... while cudnn wants libcuda.so.

Metadata

  • system: "x86_64-linux"
  • host os: Linux 6.6.64, NixOS, 25.05 (Warbler), 25.05.20241213.3566ab7
  • multi-user?: yes
  • sandbox: yes
  • version: nix-env (Nix) 2.24.10
  • channels(root): "nixos"
  • nixpkgs: /nix/store/22r7q7s9552gn1vpjigkbhfgcvhsrz68-source

Notify maintainers

@SomeoneSerge @ereslibre

Relevant tracking bug: #290609


Note for maintainers: Please tag this issue in your PR.


Add a 👍 reaction to issues you find important.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions