-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Open
Labels
0.kind: bugSomething is brokenSomething is broken
Description
Describe the bug
I'm running this in a docker image with GPU:
https://gitlab.com/scripta/escriptorium/-/wikis/docker-install
GPU training failed out of the box suggesting libcuda.so
cannot be loaded:
celery-gpu-1 | GPU available: True (cuda), used: True
celery-gpu-1 | TPU available: False, using: 0 TPU cores
celery-gpu-1 | IPU available: False, using: 0 IPUs
celery-gpu-1 | HPU available: False, using: 0 HPUs
celery-gpu-1 | `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
celery-gpu-1 | You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
celery-gpu-1 | [2024-12-18 09:09:03,469: INFO/ForkPoolWorker-1] Creating new model [1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do0.1,2 Lbx200 Do] with 77 outputs
celery-gpu-1 | [2024-12-18 09:09:03,680: INFO/ForkPoolWorker-1] Adding 1 dummy labels to validation set codec.
celery-gpu-1 | [2024-12-18 09:09:03,686: INFO/ForkPoolWorker-1] Setting seg_type to baselines.
celery-gpu-1 | LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
celery-gpu-1 | Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory
celery-gpu-1 | [2024-12-18 09:09:17,657: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:221 exited with 'signal 6 (SIGABRT)'
celery-gpu-1 | [2024-12-18 09:09:17,669: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 6 (SIGABRT) Job: 0.')
celery-gpu-1 | Traceback (most recent call last):
celery-gpu-1 | File "/usr/local/lib/python3.8/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
celery-gpu-1 | raise WorkerLostError(
celery-gpu-1 | billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 6 (SIGABRT) Job: 0.
Adding LD_LIBRARY_PATH=/usr/local/nvidia/lib64
to the environment fixes this issue.
I believe this happens because the /nix/store
path that is in ld.so search path only contains libcuda.so.1
while /usr/local/nvidia
also contains libcuda.so
:
# ls -l /usr/local/nvidia/lib64/libcuda.so*
lrwxrwxrwx 1 root root 12 Jan 1 1970 /usr/local/nvidia/lib64/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 17 Jan 1 1970 /usr/local/nvidia/lib64/libcuda.so.1 -> libcuda.so.565.77
-r-xr-xr-x 1 root root 49572768 Jan 1 1970 /usr/local/nvidia/lib64/libcuda.so.565.77
# cat /etc/ld.so.conf.d/nvcr-3734471176.conf
/nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib
# ls -l /nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/libcuda.so*
lrwxrwxrwx 1 root root 17 Dec 18 09:34 /nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/libcuda.so.1 -> libcuda.so.565.77
-r-xr-xr-x 1 root root 49572768 Jan 1 1970 /nix/store/mvl6kwi86n35pqf601raka1ncp3zkdgy-nvidia-x11-565.77-6.6.64/lib/libcuda.so.565.77
... while cudnn wants libcuda.so
.
Metadata
- system:
"x86_64-linux"
- host os:
Linux 6.6.64, NixOS, 25.05 (Warbler), 25.05.20241213.3566ab7
- multi-user?:
yes
- sandbox:
yes
- version:
nix-env (Nix) 2.24.10
- channels(root):
"nixos"
- nixpkgs:
/nix/store/22r7q7s9552gn1vpjigkbhfgcvhsrz68-source
Notify maintainers
Relevant tracking bug: #290609
Note for maintainers: Please tag this issue in your PR.
Add a 👍 reaction to issues you find important.
Metadata
Metadata
Assignees
Labels
0.kind: bugSomething is brokenSomething is broken