`StableAudioDiTModel` throws `[rank0]: RuntimeError: DDP...`  on `accelerator.prepare` when training on 4 GPUs

### Describe the bug

When trying to train on 4 GPUs, accelerator.prepare fails with:

```
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
[rank0]:[W1209 15:48:50.005698502 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[rank0]:[E1209 15:51:50.982689656 ProcessGroupNCCL.cpp:542] [Rank 0] Collective WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=71996928, NumelOut=71996928, Timeout(ms)=600000) raised the following async exception: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2027 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cdea62ab446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7cde5b629f80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7cde5b62a1cc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7cde5b62a3e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7cde5b631b5a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cde5b63361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x145c0 (0x7cdea64125c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #7: <unknown function> + 0x94ac3 (0x7cdeab3b6ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7cdeab447a04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E1209 15:58:48.760213719 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600017 milliseconds before timing out.
[rank3]:[E1209 15:58:48.760490965 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/app/src/XXX/mvc.py", line 23, in <module>
[rank3]:     transformer = accelerator.prepare(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1339, in prepare
[rank3]:     result = tuple(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1340, in <genexpr>
[rank3]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1215, in _prepare_one
[rank3]:     return self.prepare_model(obj, device_placement=device_placement)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1469, in prepare_model
[rank3]:     model = torch.nn.parallel.DistributedDataParallel(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank3]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank3]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank3]: RuntimeError: DDP expects same model across all ranks, but Rank 3 has 444 params, while rank 0 has inconsistent 0 params.
[rank1]:[E1209 15:58:48.773811212 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600030 milliseconds before timing out.
[rank1]:[E1209 15:58:48.774063321 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1209 15:58:48.832451137 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
[rank2]:[E1209 15:58:48.832673572 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]: Traceback (most recent call last):
[rank2]:   File "/app/src/XXXX/mvc.py", line 23, in <module>
[rank2]:     transformer = accelerator.prepare(
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1339, in prepare
[rank2]:     result = tuple(
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1340, in <genexpr>
[rank2]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1215, in _prepare_one
[rank2]:     return self.prepare_model(obj, device_placement=device_placement)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1469, in prepare_model
[rank2]:     model = torch.nn.parallel.DistributedDataParallel(
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank2]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank2]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank2]: RuntimeError: DDP expects same model across all ranks, but Rank 2 has 444 params, while rank 0 has inconsistent 0 params.
[rank3]:[E1209 15:58:48.851631946 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1209 15:58:48.851654029 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1209 15:58:48.851658366 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1209 15:58:48.852380321 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600017 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7024572c2446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x70240c62a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70240c631bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70240c63361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x70245741d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x70245c3cfac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x70245c460a04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600017 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7024572c2446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x70240c62a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70240c631bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70240c63361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x70245741d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x70245c3cfac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x70245c460a04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7024572c2446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x70240c2a071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x70245741d5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x70245c3cfac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x70245c460a04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]: Traceback (most recent call last):
[rank1]:   File "/app/src/XXX/mvc.py", line 23, in <module>
[rank1]:     transformer = accelerator.prepare(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1339, in prepare
[rank1]:     result = tuple(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1340, in <genexpr>
[rank1]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1215, in _prepare_one
[rank1]:     return self.prepare_model(obj, device_placement=device_placement)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1469, in prepare_model
[rank1]:     model = torch.nn.parallel.DistributedDataParallel(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank1]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
[rank1]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank1]: RuntimeError: DDP expects same model across all ranks, but Rank 1 has 444 params, while rank 0 has inconsistent 0 params.
[rank1]:[E1209 15:58:48.961469497 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1209 15:58:48.961491710 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1209 15:58:48.961496097 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1209 15:58:48.962231472 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600030 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c9a88be4446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7c9a3de2a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c9a3de31bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c9a3de3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7c9a88d3f5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7c9a8dceeac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7c9a8dd7fa04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600030 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c9a88be4446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7c9a3de2a772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c9a3de31bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c9a3de3361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7c9a88d3f5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7c9a8dceeac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7c9a8dd7fa04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c9a88be4446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7c9a3daa071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7c9a88d3f5c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7c9a8dceeac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7c9a8dd7fa04 in /lib/x86_64-linux-gnu/libc.so.6)

W1209 15:58:49.098000 2511 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2642 closing signal SIGTERM
W1209 15:58:49.098000 2511 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2643 closing signal SIGTERM
E1209 15:58:49.276000 2511 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 2644) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
mvc.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-09_15:58:49
  host      : ffb6a0a40226
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 2644)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2644
=====================================================
```

### Reproduction

Run the following script



```python
from diffusers import StableAudioDiTModel

from accelerate import Accelerator
from accelerate.logging import get_logger
from accelerate.utils import ProjectConfiguration, set_seed
output_dir = '/tmp'
logging_dir= '/tmp'

transformer = StableAudioDiTModel.from_pretrained(
    "stabilityai/stable-audio-open-1.0",
    subfolder="transformer",
    use_safetensors=True,
)
accelerator_project_config = ProjectConfiguration(
    project_dir=output_dir, logging_dir=logging_dir
)
accelerator = Accelerator(
    gradient_accumulation_steps=1,
    mixed_precision="no",
    log_with=None,
    project_config=accelerator_project_config,
)
transformer = accelerator.prepare(
        transformer
)


```


### Logs

_No response_

### System Info


- Nvidia Driver Version: 550.120        
- CUDA Version: 12.4

and environment:

```
nvidia-cublas-cu12        12.4.5.8
nvidia-cuda-cupti-cu12    12.4.127
nvidia-cuda-nvrtc-cu12    12.4.127
nvidia-cuda-runtime-cu12  12.4.127
nvidia-cudnn-cu12         9.1.0.70
nvidia-cufft-cu12         11.2.1.3
nvidia-curand-cu12        10.3.5.147
nvidia-cusolver-cu12      11.6.1.9
nvidia-cusparse-cu12      12.3.1.170
nvidia-nccl-cu12          2.21.5
nvidia-nvjitlink-cu12     12.4.127
nvidia-nvtx-cu12          12.4.127
```

Output of `diffusers-cli env`:

```
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- 🤗 Diffusers version: 0.31.0
- Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.35
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.26.5
- Transformers version: 4.47.0
- Accelerate version: 1.2.0
- PEFT version: not installed
- Bitsandbytes version: not installed
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA RTX 6000 Ada Generation, 49140 MiB
NVIDIA RTX 6000 Ada Generation, 49140 MiB
NVIDIA RTX 6000 Ada Generation, 49140 MiB
NVIDIA RTX 6000 Ada Generation, 49140 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
```

### Who can help?

_No response_

### Things I've tried
- Using `diffusers` from `main` 
- Using `transformers` from `main`
- Using a simpler accelerate `yaml`
- Making sure it works on single GPU training
- Tried loading an SD3 transformer to see if its Stable Audio DiT specific - Fails in a similar fashion (see below)
- Tried disabling P2P with `NCCL_P2P_DISABLE=1` - In this case it works, although I feel it's much slower.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`StableAudioDiTModel` throws `[rank0]: RuntimeError: DDP...` on `accelerator.prepare` when training on 4 GPUs #10160

Describe the bug

Reproduction

Logs

System Info

Who can help?

Things I've tried

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

StableAudioDiTModel throws [rank0]: RuntimeError: DDP... on accelerator.prepare when training on 4 GPUs #10160

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Things I've tried

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`StableAudioDiTModel` throws `[rank0]: RuntimeError: DDP...` on `accelerator.prepare` when training on 4 GPUs #10160