EMA training for PEFT LoRAs

**Is your feature request related to a problem? Please describe.**

EMAModel in Diffusers is not plumbed for interacting well with PEFT LoRAs, which leaves users to implement their own.

The idea has been thrown around that LoRA did not benefit from EMA, and research papers had shown this. However, after curiosity piqued, took a bit but managed to make it work.

[Here](https://github.com/bghira/SimpleTuner/pull/1176/files#diff-ee5588f68fc9ba961bdd1ac8c429264a6e931d34e8682f89fd97d7b8138e14ef) is a pull request for SimpleTuner where I've updated my EMAModel implementation to behave more like `nn.Module` and allow EMAModel to be passed into more processes without "funny business".

[This spot](https://github.com/bghira/SimpleTuner/pull/1176/files#diff-44a2db0ae71ef0cdea6e45b1a3fa796797667ab0dc5cc553d70f5dcd0b0abe89R180) in the save hooks was hardcoded to take the class name following Diffusers convention but we can do more dynamic approach in perhaps a `training_utils` helper method.

Just a bit downward at L208 in the save hooks, I did something I'm not really 100% happy with, but users were:
- For my own trainer's convenience, I save a copy of the EMA model in a simple loadable state_dict format so that I can load **this** during resume.
- Additionally, we save a 2nd copy of the EMA in the PEFT LoRA format so that it can be loaded by pipelines.

The tricky part is the 2nd copy of the EMA model that gets saved in the standard LoRA format:

```py
        if self.args.use_ema:
            # we'll temporarily overwrite teh LoRA parameters with the EMA parameters to save it.
            logger.info("Saving EMA model to disk.")
            trainable_parameters = [
                p
                for p in self._primary_model().parameters()
                if p.requires_grad
            ]
            self.ema_model.store(trainable_parameters)
            self.ema_model.copy_to(trainable_parameters)
            if self.transformer is not None:
                self.pipeline_class.save_lora_weights(
                    os.path.join(output_dir, "ema"),
                    transformer_lora_layers=convert_state_dict_to_diffusers(
                        get_peft_model_state_dict(self._primary_model())
                    ),
                )
            elif self.unet is not None:
                self.pipeline_class.save_lora_weights(
                    os.path.join(output_dir, "ema"),
                    unet_lora_layers=convert_state_dict_to_diffusers(
                        get_peft_model_state_dict(self._primary_model())
                    ),
                )
            self.ema_model.restore(trainable_parameters)
```

this could probably be done more nicely with a `trainable_parameters()` method on the model classes where appropriate.

I guess the decorations with converting state dicts are required for now, but it would be ideal if this could be simplified so that newcomers do not have to look into and understand so many moving pieces.

For quantised training, we have to quantise the EMA model just like the trained model had done to it.

The validations were kind of a pain but I wanted to make the EMA load/unload possible to do during the process repeatedly so that each prompt can be validated for the ckpt as well as the EMA weights. [Here](https://github.com/bghira/SimpleTuner/pull/1176/files#diff-6871bffca16efde79d62d3abce61f57a84df91f7cec2227a9b3d72815e6a6cd7R1540) is my method for enabling (and just below, disabling) the EMA model at inference time.

However, the effect is really nice; here you see the starting SD 3.5M on the left, the trained LoRA in the centre, and EMA on the right.

![image](https://github.com/user-attachments/assets/83a36d31-cf7d-4c9c-b7b2-641c918a5928)

![image](https://github.com/user-attachments/assets/e71e0a4c-6415-4868-8ba1-1f64abdef1a4)

![image](https://github.com/user-attachments/assets/3c619497-9615-4971-867d-ea7dc8d606f0)

![image](https://github.com/user-attachments/assets/ac96d073-73e5-4c70-aede-d7e70d56ab19)

![image](https://github.com/user-attachments/assets/3d375427-fb29-4ca1-bd31-879f9dd53254)

these samples are from 60,000 steps of training a rank-128 PEFT LoRA on all of the attn layers for the SD 3.5 Medium model on ~120,000 high quality photos.

while it's not a cure-all for training problems, throughout the entire duration of training, the EMA model has outperformed the trained checkpoint.

It'd be a good idea to consider someday including EMA for LoRA with related improvements for saving/loading EMA weights on adapters so that users can receive better results from the training examples. I don't think the validation changes are needed, but they can be done in a non-intrusive way, more nicely than I have done here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EMA training for PEFT LoRAs #9998

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

EMA training for PEFT LoRAs #9998

Description

Activity

bghira commented on Nov 22, 2024

sayakpaul commented on Nov 23, 2024

bghira commented on Nov 23, 2024

double8fun commented on Jul 8, 2025

bghira commented on Jul 8, 2025

double8fun commented on Jul 8, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions