Open
Description
While exploring optimizations listed in the documentation, I find myself unable to free GPU memory after using torch.compile
on a StableDiffusionXLPipeline UNet.
from diffusers import StableDiffusionXLPipeline
pipe = StableDiffusionXLPipeline.from_pretrained(
'stabilityai/stable-diffusion-xl-base-1.0',
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True
).to('cuda')
# Compile UNet
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
generator = torch.Generator(device="cuda").manual_seed(42)
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt=prompt, num_inference_steps=20, generator=generator).images[0]
del pipe
gc.collect()
torch._dynamo.reset()
torch.cuda.empty_cache()
torch.cuda.synchronize()
# GPU memory is still in use, but it's not the case when we do not compile the pipeline unet.
It can sometimes be useful to free the GPU memory, especially if you want to load and compile another pipeline checkpoint to perform another large number of generations.
I made a code reproduction in collab for testing.
Am I missing something? Could it be a memory leak on the compilation backend side, in which case it might be better to turn to PyTorch to discuss about this?
System Info
python: 3.10.12
diffusers: 0.30.3
torch: 2.4.1+cu121
Running on Google Colab?: Yes