Faster diffusion on less beefy GPUs ⚡️ #6609

sayakpaul · 2024-01-17T04:18:47Z

sayakpaul
Jan 17, 2024
Maintainer

We recently published: Accelerating Generative AI Part III: Diffusion, Fast that shows how to:

We showed this on an 80GB A100. The techniques presented in the post are largely applicable to relatively modern GPU cards, 4090, for example. This is because things like scaled_dot_product_attention() (SDPA) and torch.compile() only tend to show their benefits on relatively modern GPUs. Both of them are crucial for squeezing the maximum performance out of these models.

But how can we obtain almost similar speedups on more consumer-friendly cards such as 3060, T4, V100, etc.? What are the challenges, bottlenecks, and limitations behind doing so? This thread is for discussing it.

Important

First of all, it looks like we're quite bottlenecked by what torch.compile() and SDPA have to offer for these relative more consumer-friendly cards. This is evident from this study. The relative speedup from SDPA and torch.compile() tends to diminish as we move to these cards. So, I think we need to target this problem from an architectural point of view.

So, here are some points that might be worth trying out. Here, we're trying to optimize both axes: memory and speed. This is because memory for more consumer-friendly cards usually doesn't exceed 16 GB.

Using a reduced precision like FP16 is always beneficial as it cuts down the memory and therefore also accelerates inference. But the speed gains coming from dedicated cores (such as tensor cores) won't be evident in the cards that don't have specialized cores to run FP16 computation.
The memory limit of these cards is a factor (as also mentioned at the beginning). Since a diffusion system consists of multiple model-level components, we need a way to only load them when required. diffusers allows doing this with "offloading". More details are here. Offloading comes at the cost of increased inference latency because of the device placement overhead.
- One could precompute the text embeddings and reuse them in the subsequent computations. However, for real-world use cases, this is quite restrictive as your application will definitely be meeting with different prompts for which precomputation won't make much sense.
- One could use int8 or NF4 precision types as is done in the LLM world. But this doesn't benefit diffusion models that much compared to offloading. Refer to our study here.

Warning

Below, I would like to present some alternative approaches that don't use the original SDXL model because its memory requirements are a limiting factor and it adds to the increased inference latency.

Use a compressed model like SSD-1B which is known to produce similar quality results.
Use techniques like DPO to improve the quality of smaller models like SD v1.5. We provide training scripts in this directory.

Below are some techniques that are hardware-agnostic and target inference latency while also reducing memory in some cases. They should complement what's discussed in Accelerating Generative AI Part III: Diffusion, Fast.

Vargol · 2024-01-17T11:16:39Z

Vargol
Jan 17, 2024

Sometimes you can work around memory issues by thinking a little differently.

If you want to generate multiple images with SDXL and refiner but don't have the memory for both, don't do

base -> refiner
base -> refiner
base -> refiner
base -> refiner

do

base, base, base, base
refiner, refiner, refiner, refiner

So you're not swapping the model in and out with each image.
You might want to use a different generator for each image to get repeatable images with having to render the whole bunch

0 replies

PaulCouairon · 2024-01-17T13:16:49Z

PaulCouairon
Jan 17, 2024

This paper: Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models (https://arxiv.org/abs/2312.09608) conducts a thorough empirical study of the features of the UNet in the diffusion model and introduces an encoder propagation scheme to efficiently accelerate the diffusion sampling.

1 reply

sayakpaul Jan 17, 2024
Maintainer Author

I think DeepCache is similar in scope, too.

tin2tin · 2024-01-18T16:19:10Z

tin2tin
Jan 18, 2024

This pf8 stuff looks like it can seriously reduce the VRAM need: AUTOMATIC1111/stable-diffusion-webui#14031 ... without compromising the quality.

I wonder if that would work on SVD/SVDXT too. Currently, these models need at least 12 GB VRAM to run, or they'll be using shared memory, with extreme slowdowns as a result, and basically making them unusable locally on consumer computers.
6 GB VRAM > 24 frames SVDXT = 25 minutes.
8 GB VRAM > 24 frames SVDXT = 15 minutes.

1 reply

sayakpaul Jan 18, 2024
Maintainer Author

Being aware of the tradeoffs is helpful.

Even if you reduce VRAM usage, unavailability of faster kernels on GPUs can lead to extreme slowdowns. This is often seen when using quantization.

So, there might be cases where we simply cannot push the existing limits because of inherent bottlenecks. This is why I shared some alternative approaches in my original thread as well.

bghira · 2024-04-13T12:37:14Z

bghira
Apr 13, 2024

honestly, CoreML has this one solved for the most part. i love how the inference latency drops from 50 seconds to 3 seconds.

we need a generalised process that can make a CoreML-style model for CPU, CUDA, ROCm, XPU, and MPS.

1 reply

sayakpaul Apr 25, 2024
Maintainer Author

Cc: @pcuenca if this is of interest.

sayakpaul · 2024-06-25T07:50:54Z

sayakpaul
Jun 25, 2024
Maintainer Author

Related threads:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster diffusion on less beefy GPUs ⚡️ #6609

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Faster diffusion on less beefy GPUs ⚡️ #6609

Uh oh!

sayakpaul Jan 17, 2024 Maintainer

Replies: 5 comments · 3 replies

Uh oh!

Vargol Jan 17, 2024

Uh oh!

PaulCouairon Jan 17, 2024

Uh oh!

sayakpaul Jan 17, 2024 Maintainer Author

Uh oh!

Uh oh!

tin2tin Jan 18, 2024

Uh oh!

sayakpaul Jan 18, 2024 Maintainer Author

Uh oh!

bghira Apr 13, 2024

Uh oh!

sayakpaul Apr 25, 2024 Maintainer Author

Uh oh!

sayakpaul Jun 25, 2024 Maintainer Author

sayakpaul
Jan 17, 2024
Maintainer

Replies: 5 comments 3 replies

Vargol
Jan 17, 2024

PaulCouairon
Jan 17, 2024

sayakpaul Jan 17, 2024
Maintainer Author

tin2tin
Jan 18, 2024

sayakpaul Jan 18, 2024
Maintainer Author

bghira
Apr 13, 2024

sayakpaul Apr 25, 2024
Maintainer Author

sayakpaul
Jun 25, 2024
Maintainer Author