Is Parallel Processing Possible to Avoid Memory Overflow in ControlNet SD-XL? #7743

SinaDavari · 2024-04-22T16:48:09Z

SinaDavari
Apr 22, 2024

Hello,

I am using StableDiffusionXLControlNetImg2ImgPipeline from the diffusers library. My system is equipped with two GPUs, each with 24GB of memory. Despite this, I've noticed that only one GPU is actively being used during processing. The total available GPU memory is thus incorrectly perceived as 24GB, whereas it should be 48GB when considering both GPUs.

This limitation in GPU utilization is causing CUDA out-of-memory errors as the program exhausts available memory on the single active GPU. I am seeking advice on how to effectively modify my code to configure and leverage a multi-GPU setup, allowing me to fully utilize the combined resources of both GPUs. Below is a snippet of my code:

controlnet = [
                ControlNetModel.from_pretrained("diffusers/controlnet-depth-sdxl-1.0", variant="fp16", use_safetensors=True, torch_dtype=torch.float16).to("cuda"),
                ControlNetModel.from_pretrained("SargeZT/sdxl-controlnet-seg", torch_dtype=torch.float16).to("cuda"),
                ]
    
vae = AutoencoderKL.from_pretrained("stabilityai/sdxl-vae", torch_dtype=torch.float16).to("cuda")

pipe = StableDiffusionXLControlNetImg2ImgPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        controlnet=controlnet,
        vae=vae,
        variant="fp16",
        use_safetensors=True,
        torch_dtype=torch.float16
    ).to("cuda")

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

generator = torch.Generator(device="cpu").manual_seed(seed_num)

for (img, image), condition_image in zip(controlnet_input_images, controlnet_conditions):
    out_image = pipe(
        prompt = prompt,
        negative_prompt = n_prompts,
        control_image = condition_image,
        image = [image],
        height = output_img_size[0],
        width = output_img_size[1],
        guidance_scale = CFG,
        strength = img_strength,
        num_inference_steps=num_inference_steps,
        generator=generator,
        controlnet_conditioning_scale = controlnet_conditioning_scale,
    ).images[0]
    out_img_path = folder_path_with_timestamp / img
    out_image.save(out_img_path)
    torch.cuda.empty_cache()

Thank you so very much!

tolgacangoz · 2024-04-22T17:37:01Z

tolgacangoz
Apr 22, 2024

Hello!
Could you not use .to('cuda') when using cpu offload? See how it is applied: https://huggingface.co/docs/diffusers/main/en/optimization/memory#model-offloading
Also see: https://huggingface.co/docs/diffusers/main/en/training/distributed_inference

1 reply

tolgacangoz Apr 23, 2024

@SinaDavari Could you examine this Kaggle Notebook to see if it helps?

SinaDavari · 2024-04-22T21:08:52Z

SinaDavari
Apr 22, 2024
Author

Hi Tolga,

Thank you for your prompt response. I attempted your suggestion by removing .to('cuda'), but unfortunately, this did not resolve the issue. Additionally, when I do not use enable_model_cpu_offload(), I encounter a CUDA out-of-memory error.

I also explored the strategies outlined on Distributed inference page. For instance, I tried implementing accelerate.PartialState by removing all .to('cuda') references and sending the pipeline to distributed_state.device, where distributed_state = PartialState(). Despite these adjustments, I still ran into memory issues.

It appears I'm missing a critical setup step somewhere. The example in the guide splits GPU resources into two processes. But in my case, there is only one process. Any insights or further guidance would be greatly appreciated. (I'm still unsure how to leverage all available 48GB of combined GPU memory, rather than being limited to a single 24GB.)

Thanks very much.

1 reply

asomoza Apr 22, 2024
Maintainer

Two controlnets and a SDXL model shouldn't be using anything near 24 GB of VRAM. How are you hitting the OOM with what you're doing?

I think the distributed inference works by splitting the modules inside the pipeline, if you're hitting the OOM because you're using a big batch, I don't think that's going to help.

yiyixuxu · 2024-04-23T01:45:35Z

yiyixuxu
Apr 23, 2024
Maintainer

cc @sayakpaul

0 replies

sayakpaul · 2024-04-23T02:21:36Z

sayakpaul
Apr 23, 2024
Maintainer

If you want to keep the model-level components of your pipeline distributed across multiple devices, then you can use the device_map="balanced" option. See more here: https://huggingface.co/docs/diffusers/main/en/training/distributed_inference#device-placement.

However, you will be bottlenecked by the speed there. If you are still running into memory issues, I suggest doing the following (ordered):

Decrease the batch size
Try using a smaller VAE like https://huggingface.co/docs/diffusers/en/api/models/autoencoder_tiny
Try using a VAE slicing (see: https://huggingface.co/docs/diffusers/main/en/optimization/memory)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is Parallel Processing Possible to Avoid Memory Overflow in ControlNet SD-XL? #7743

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is Parallel Processing Possible to Avoid Memory Overflow in ControlNet SD-XL? #7743

SinaDavari Apr 22, 2024

Replies: 4 comments · 2 replies

tolgacangoz Apr 22, 2024

tolgacangoz Apr 23, 2024

SinaDavari Apr 22, 2024 Author

asomoza Apr 22, 2024 Maintainer

yiyixuxu Apr 23, 2024 Maintainer

sayakpaul Apr 23, 2024 Maintainer

SinaDavari
Apr 22, 2024

Replies: 4 comments 2 replies

tolgacangoz
Apr 22, 2024

SinaDavari
Apr 22, 2024
Author

asomoza Apr 22, 2024
Maintainer

yiyixuxu
Apr 23, 2024
Maintainer

sayakpaul
Apr 23, 2024
Maintainer