Open
Description
Describe the bug
When trying to train on 4 GPUs, accelerator.prepare fails with:
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
[rank0]:[W1209 15:48:50.005698502 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[rank0]:[E1209 15:51:50.982689656 ProcessGroupNCCL.cpp:542] [Rank 0] Collective WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=71996928, NumelOut=71996928, Timeout(ms)=600000) raised the following async exception: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2027 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cdea62ab446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7cde5b629f80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7cde5b62a1cc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7cde5b62a3e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7cde5b631b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cde5b63361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x145c0 (0x7cdea64125c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #7: <unknown function> + 0x94ac3 (0x7cdeab3b6ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7cdeab447a04 in /lib/x86_64-linux-gnu/libc.so.6)
[rank3]:[E1209 15:58:48.760213719 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600017 milliseconds before timing out.
[rank3]:[E1209 15:58:48.760490965 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]: Traceback (most recent call last):
[rank3]: File "/app/src/XXX/mvc.py", line 23, in <module>
[rank3]: transformer = accelerator.prepare(
[rank3]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1339, in prepare
[rank3]: result = tuple(
[rank3]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1340, in <genexpr>
[rank3]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1215, in _prepare_one
[rank3]: return self.prepare_model(obj, device_placement=device_placement)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1469, in prepare_model
[rank3]: model = torch.nn.parallel.DistributedDataParallel(
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank3]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank3]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank3]: RuntimeError: DDP expects same model across all ranks, but Rank 3 has 444 params, while rank 0 has inconsistent 0 params.
[rank1]:[E1209 15:58:48.773811212 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600030 milliseconds before timing out.
[rank1]:[E1209 15:58:48.774063321 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1209 15:58:48.832451137 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
[rank2]:[E1209 15:58:48.832673572 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]: Traceback (most recent call last):
[rank2]: File "/app/src/XXXX/mvc.py", line 23, in <module>
[rank2]: transformer = accelerator.prepare(
[rank2]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1339, in prepare
[rank2]: result = tuple(
[rank2]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1340, in <genexpr>
[rank2]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1215, in _prepare_one
[rank2]: return self.prepare_model(obj, device_placement=device_placement)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1469, in prepare_model
[rank2]: model = torch.nn.parallel.DistributedDataParallel(
[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank2]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank2]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank2]: RuntimeError: DDP expects same model across all ranks, but Rank 2 has 444 params, while rank 0 has inconsistent 0 params.
[rank3]:[E1209 15:58:48.851631946 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1209 15:58:48.851654029 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1209 15:58:48.851658366 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1209 15:58:48.852380321 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600017 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7024572c2446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x70240c62a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70240c631bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70240c63361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x70245741d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x70245c3cfac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x70245c460a04 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600017 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7024572c2446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x70240c62a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70240c631bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70240c63361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x70245741d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x70245c3cfac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x70245c460a04 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7024572c2446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x70240c2a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x70245741d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x70245c3cfac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x70245c460a04 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]: Traceback (most recent call last):
[rank1]: File "/app/src/XXX/mvc.py", line 23, in <module>
[rank1]: transformer = accelerator.prepare(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1339, in prepare
[rank1]: result = tuple(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1340, in <genexpr>
[rank1]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1215, in _prepare_one
[rank1]: return self.prepare_model(obj, device_placement=device_placement)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1469, in prepare_model
[rank1]: model = torch.nn.parallel.DistributedDataParallel(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank1]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank1]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank1]: RuntimeError: DDP expects same model across all ranks, but Rank 1 has 444 params, while rank 0 has inconsistent 0 params.
[rank1]:[E1209 15:58:48.961469497 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1209 15:58:48.961491710 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1209 15:58:48.961496097 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1209 15:58:48.962231472 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600030 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c9a88be4446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7c9a3de2a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c9a3de31bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c9a3de3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7c9a88d3f5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7c9a8dceeac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7c9a8dd7fa04 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600030 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c9a88be4446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7c9a3de2a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c9a3de31bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c9a3de3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7c9a88d3f5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7c9a8dceeac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7c9a8dd7fa04 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c9a88be4446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7c9a3daa071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7c9a88d3f5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7c9a8dceeac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7c9a8dd7fa04 in /lib/x86_64-linux-gnu/libc.so.6)
W1209 15:58:49.098000 2511 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2642 closing signal SIGTERM
W1209 15:58:49.098000 2511 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2643 closing signal SIGTERM
E1209 15:58:49.276000 2511 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 2644) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
mvc.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-12-09_15:58:49
host : ffb6a0a40226
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 2644)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2644
=====================================================
Reproduction
Run the following script
from diffusers import StableAudioDiTModel
from accelerate import Accelerator
from accelerate.logging import get_logger
from accelerate.utils import ProjectConfiguration, set_seed
output_dir = '/tmp'
logging_dir= '/tmp'
transformer = StableAudioDiTModel.from_pretrained(
"stabilityai/stable-audio-open-1.0",
subfolder="transformer",
use_safetensors=True,
)
accelerator_project_config = ProjectConfiguration(
project_dir=output_dir, logging_dir=logging_dir
)
accelerator = Accelerator(
gradient_accumulation_steps=1,
mixed_precision="no",
log_with=None,
project_config=accelerator_project_config,
)
transformer = accelerator.prepare(
transformer
)
Logs
No response
System Info
- Nvidia Driver Version: 550.120
- CUDA Version: 12.4
and environment:
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
Output of diffusers-cli env
:
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
- 🤗 Diffusers version: 0.31.0
- Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.35
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.26.5
- Transformers version: 4.47.0
- Accelerate version: 1.2.0
- PEFT version: not installed
- Bitsandbytes version: not installed
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA RTX 6000 Ada Generation, 49140 MiB
NVIDIA RTX 6000 Ada Generation, 49140 MiB
NVIDIA RTX 6000 Ada Generation, 49140 MiB
NVIDIA RTX 6000 Ada Generation, 49140 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Who can help?
No response
Things I've tried
- Using
diffusers
frommain
- Using
transformers
frommain
- Using a simpler accelerate
yaml
- Making sure it works on single GPU training
- Tried loading an SD3 transformer to see if its Stable Audio DiT specific - Fails in a similar fashion (see below)
- Tried disabling P2P with
NCCL_P2P_DISABLE=1
- In this case it works, although I feel it's much slower.