Skip to content

fix: correctly resume moe lora ckpt#1325

Open
ZhiyuLi-Nvidia wants to merge 1 commit intomainfrom
zhiyul/fix_moe_lora_resume
Open

fix: correctly resume moe lora ckpt#1325
ZhiyuLi-Nvidia wants to merge 1 commit intomainfrom
zhiyul/fix_moe_lora_resume

Conversation

@ZhiyuLi-Nvidia
Copy link
Contributor

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia commented Feb 18, 2026

Problem

Resuming training of MoE LoRA or QLoRA models from a checkpoint causes a loss spike (back to random-init levels). The LoRA adapter weights are silently not loaded.

Root Cause

PyTorch DCP (get_model_state_dict / set_model_state_dict) loading cannot handle:

  • EP (Expert Parallelism): MoE expert modules use custom FQNs (e.g. gate_up_linear.weight0) that DCP's FQN resolution cannot traverse → KeyError on save, silent skip on load, i.e. the model weight is from scratch.

Solution

Bypass DCP entirely for PEFT models with EP or quantization. All changes in nemo_automodel/components/checkpoint/stateful_wrappers.py:

  • Load: _set_peft_state_dict() — matches saved tensors by name, redistributes full tensors back into EP DTensor shards.

Test

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ZhiyuLi-Nvidia
Copy link
Contributor Author

/ok to test 0751476

@thomasdhc thomasdhc added the r0.3.0 Add for cherry-pick into release branch r0.3.0 label Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.3.0 Add for cherry-pick into release branch r0.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments