Replies: 5 comments 3 replies
-
Sometimes you can work around memory issues by thinking a little differently. If you want to generate multiple images with SDXL and refiner but don't have the memory for both, don't do base -> refiner do base, base, base, base So you're not swapping the model in and out with each image. |
Beta Was this translation helpful? Give feedback.
-
This paper: Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models (https://arxiv.org/abs/2312.09608) conducts a thorough empirical study of the features of the UNet in the diffusion model and introduces an encoder propagation scheme to efficiently accelerate the diffusion sampling. |
Beta Was this translation helpful? Give feedback.
-
This pf8 stuff looks like it can seriously reduce the VRAM need: AUTOMATIC1111/stable-diffusion-webui#14031 ... without compromising the quality. I wonder if that would work on SVD/SVDXT too. Currently, these models need at least 12 GB VRAM to run, or they'll be using shared memory, with extreme slowdowns as a result, and basically making them unusable locally on consumer computers. |
Beta Was this translation helpful? Give feedback.
-
honestly, CoreML has this one solved for the most part. i love how the inference latency drops from 50 seconds to 3 seconds. we need a generalised process that can make a CoreML-style model for CPU, CUDA, ROCm, XPU, and MPS. |
Beta Was this translation helpful? Give feedback.
-
Related threads: |
Beta Was this translation helpful? Give feedback.
-
We recently published: Accelerating Generative AI Part III: Diffusion, Fast that shows how to:
We showed this on an 80GB A100. The techniques presented in the post are largely applicable to relatively modern GPU cards, 4090, for example. This is because things like
scaled_dot_product_attention()
(SDPA) andtorch.compile()
only tend to show their benefits on relatively modern GPUs. Both of them are crucial for squeezing the maximum performance out of these models.But how can we obtain almost similar speedups on more consumer-friendly cards such as 3060, T4, V100, etc.? What are the challenges, bottlenecks, and limitations behind doing so? This thread is for discussing it.
Important
First of all, it looks like we're quite bottlenecked by what
torch.compile()
and SDPA have to offer for these relative more consumer-friendly cards. This is evident from this study. The relative speedup from SDPA andtorch.compile()
tends to diminish as we move to these cards. So, I think we need to target this problem from an architectural point of view.So, here are some points that might be worth trying out. Here, we're trying to optimize both axes: memory and speed. This is because memory for more consumer-friendly cards usually doesn't exceed 16 GB.
diffusers
allows doing this with "offloading". More details are here. Offloading comes at the cost of increased inference latency because of the device placement overhead.Warning
Below, I would like to present some alternative approaches that don't use the original SDXL model because its memory requirements are a limiting factor and it adds to the increased inference latency.
Below are some techniques that are hardware-agnostic and target inference latency while also reducing memory in some cases. They should complement what's discussed in Accelerating Generative AI Part III: Diffusion, Fast.
Beta Was this translation helpful? Give feedback.
All reactions