SD3.5-Large DreamBooth Training - Over 80GB VRAM Usage #10084

deman311 · 2024-12-02T12:50:05Z

deman311
Dec 2, 2024

We are trying to train SD3.5-large DreamBooth using the script from: train_dreambooth_sd3.py [by DreamBooth]
We are using an Azure server with an A100 GPU (80GB VRAM).

⚠️ We are running out of memory on step 0

❕It does work without '--train_text_encoder'. It seems that there might be a memory leak or issue with training the text encoder with the current script / model.
❓Does it make sense that the model uses over 80GB of VRAM?
❓Do you have any recommendations on decreasing VRAM usage
Other than:
. 8bit Adam
. Mixed precision 16fp
. xformers (that doesn't work with SD3.5)

SD3.5-Medium

Works on the same machine with the same parameters using 26GB. (80GB with text encoder!)

🔨 What we tried:

Running on lower resolution (up to 10x10).
Increasing gradient accumulation steps.
Debug the python file without Accelerate which resulted in crashing at the "Optimizer.step()" line.
Removing the T5 (largest Tokenizer ~10GB) manually from the script altogether.

🧪 These are our parameters:

!accelerate launch train_dreambooth_sd3.py
--pretrained_model_name_or_path="stabilityai/stable-diffusion-3.5-large"
--output_dir="sd_outputs"
--instance_data_dir="ogo"
--instance_prompt="the face of ogo person"
--resolution=512
--train_batch_size=1
--gradient_accumulation_steps=2
--gradient_checkpointing
--checkpointing_steps=200
--learning_rate=2e-6
--text_encoder_lr=1e-6
--train_text_encoder
--lr_scheduler="constant"
--lr_warmup_steps=0
--max_train_steps=800
--seed="0"
--use_8bit_adam
--mixed_precision="fp16"

👨🏻‍💻 Stacktrace

2024-12-02 12:36:35.615846: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1733142995.629356 226993 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1733142995.633681 226993 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
12/02/2024 12:36:39 - INFO - main - Distributed environment: DistributedType.NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: no

You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'base_shift', 'max_image_seq_len', 'max_shift', 'base_image_seq_len', 'invert_sigmas', 'use_dynamic_shifting'} was not found in config. Values will be initialized to default values.
Downloading shards: 100%|███████████████████████| 2/2 [00:00<00:00, 3450.68it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:03<00:00, 1.73s/it]
Fetching 2 files: 100%|█████████████████████████| 2/2 [00:00<00:00, 7476.48it/s]
{'dual_attention_layers'} was not found in config. Values will be initialized to default values.
12/02/2024 12:37:04 - INFO - main - ***** Running training *****
12/02/2024 12:37:04 - INFO - main - Num examples = 1
12/02/2024 12:37:04 - INFO - main - Num batches each epoch = 1
12/02/2024 12:37:04 - INFO - main - Num Epochs = 800
12/02/2024 12:37:04 - INFO - main - Instantaneous batch size per device = 1
12/02/2024 12:37:04 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 2
12/02/2024 12:37:04 - INFO - main - Gradient Accumulation steps = 2
12/02/2024 12:37:04 - INFO - main - Total optimization steps = 800
Steps: 0%| | 0/800 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/azureuser/Picturethis/Dima/train_dreambooth_sd3.py", line 1811, in
main(args)
File "/home/azureuser/Picturethis/Dima/train_dreambooth_sd3.py", line 1666, in main
optimizer.step()
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/accelerate/optimizer.py", line 171, in step
self.optimizer.step(closure)
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
return func.get(opt, opt.class)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/torch/optim/optimizer.py", line 487, in wrapper
out = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/bitsandbytes/optim/optimizer.py", line 288, in step
self.init_state(group, p, gindex, pindex)
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/bitsandbytes/optim/optimizer.py", line 474, in init_state
state["state2"] = self.get_state_buffer(p, dtype=torch.uint8)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/bitsandbytes/optim/optimizer.py", line 328, in get_state_buffer
return torch.zeros_like(p, dtype=dtype, device=p.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 10.62 MiB is free. Process 68964 has 530.00 MiB memory in use. Including non-PyTorch memory, this process has 78.45 GiB memory in use. Of the allocated memory 75.60 GiB is allocated by PyTorch, and 2.35 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Steps: 0%| | 0/800 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/home/azureuser/mambaforge/envs/picturevenv/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1168, in launch_command
simple_launcher(args)
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 763, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/azureuser/mambaforge/envs/picturevenv/bin/python3.11', 'train_dreambooth_sd3.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-3.5-large', '--output_dir=sd_outputs', '--instance_data_dir=ogo', '--instance_prompt=the face of ogo person', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=2', '--gradient_checkpointing', '--checkpointing_steps=200', '--learning_rate=2e-6', '--text_encoder_lr=1e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=800', '--seed=0', '--use_8bit_adam']' returned non-zero exit status 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SD3.5-Large DreamBooth Training - Over 80GB VRAM Usage #10084

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

SD3.5-Large DreamBooth Training - Over 80GB VRAM Usage #10084

Uh oh!

Uh oh!

deman311 Dec 2, 2024

⚠️ We are running out of memory on step 0

SD3.5-Medium

🔨 What we tried:

🧪 These are our parameters:

👨🏻‍💻 Stacktrace

Replies: 0 comments

deman311
Dec 2, 2024