Fuse Similar Modules for Inference? #2463

Beinsezii · 2025-03-27T18:53:16Z

Beinsezii
Mar 27, 2025

Currently using multiple LoRa modules at once can severely impact performance. While fusing to the base module weights is an option, it's lossy in an irrecoverable manner short of duplicating the entire network into RAM.

I was discussing with @bghira and @sayakpaul about the possibility of fusing similar modules, such that where you have two loras A, B, the computation could be M + (AB) rather than M + A + B every step. You would still need to keep a copy of the original lora modules, but this is much more viable than copying something a 20GiB base model.

BenjaminBossan · 2025-03-28T11:02:27Z

BenjaminBossan
Mar 28, 2025
Maintainer

What you're asking sounds to me like the add_weighted_adapter function in PEFT. It allows you to combine multiple adapters into a new, single adapter, using one of various types of combinations. Did you give that a try?

4 replies

Beinsezii Apr 2, 2025
Author

Not yet. I have to look inside diffusers to see what it would look like to directly manage the peft adapters instead of using their sugar functions.

BenjaminBossan Apr 3, 2025
Maintainer

For my understanding, is your goal to implement the same functionality in diffusers?

Beinsezii Apr 3, 2025
Author

Yeah. Inference with multiple loras is super rough, you can lose like 80% of your throughput. I figured fusing multiple loras of the same rank, possibly expanding to the higher rank would mitigate some of that without needing to copy the whole model. I was discussing with Sayak and he said to ask here.

BenjaminBossan Apr 3, 2025
Maintainer

This should indeed by addressed by add_weighted_adapter. I don't have personal experience of which setting works best for diffusion models, it will require some experimentation. It could be useful to check mergekit if they can share some insights.

If you have questions or if you create a diffusers PR, feel free to ping me.

sayakpaul · 2025-04-03T14:43:18Z

sayakpaul
Apr 3, 2025
Maintainer

We experimented with merging methods for adapters here:

https://huggingface.co/blog/peft_merging

I would suggest experimenting with the functions from peft for now and reporting any performance related improvements from that (if any). Currently, at least, it is not clear to me if there needs to be modifications made to either diffusers or peft for this.

3 replies

sayakpaul Apr 3, 2025
Maintainer

I experimented with LoRA resizing a while back:
https://huggingface.co/sayakpaul/flux-lora-resizing

And then Benjamin is working on this:
https://github.com/huggingface/diffusers/pull/9453/files

bghira Apr 3, 2025

yes, your work on LoRA resizing is why we thought to upsample the lower-rank LoRAs to higher-rank before merge :)

bghira Apr 3, 2025

i guess it never occurred to me that the peft merging blog stuff was more than just doing A+B(+n) model merging to save to disk, which is what i usually think of 🤔 it's a bit eye-opening, i suppose.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse Similar Modules for Inference? #2463

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Fuse Similar Modules for Inference? #2463

Beinsezii Mar 27, 2025

Replies: 2 comments · 7 replies

BenjaminBossan Mar 28, 2025 Maintainer

Beinsezii Apr 2, 2025 Author

BenjaminBossan Apr 3, 2025 Maintainer

Beinsezii Apr 3, 2025 Author

BenjaminBossan Apr 3, 2025 Maintainer

sayakpaul Apr 3, 2025 Maintainer

sayakpaul Apr 3, 2025 Maintainer

bghira Apr 3, 2025

bghira Apr 3, 2025

Beinsezii
Mar 27, 2025

Replies: 2 comments 7 replies

BenjaminBossan
Mar 28, 2025
Maintainer

Beinsezii Apr 2, 2025
Author

BenjaminBossan Apr 3, 2025
Maintainer

Beinsezii Apr 3, 2025
Author

BenjaminBossan Apr 3, 2025
Maintainer

sayakpaul
Apr 3, 2025
Maintainer

sayakpaul Apr 3, 2025
Maintainer