fix: correctly resume moe lora ckpt by ZhiyuLi-Nvidia · Pull Request #1325 · NVIDIA-NeMo/Automodel

ZhiyuLi-Nvidia · 2026-02-18T18:03:02Z

Problem

Resuming training of MoE LoRA or QLoRA models from a checkpoint causes a loss spike (back to random-init levels). The LoRA adapter weights are silently not loaded.

Root Cause

PyTorch DCP (get_model_state_dict / set_model_state_dict) loading cannot handle:

EP (Expert Parallelism): MoE expert modules use custom FQNs (e.g. gate_up_linear.weight0) that DCP's FQN resolution cannot traverse → KeyError on save, silent skip on load, i.e. the model weight is from scratch.

Solution

Bypass DCP entirely for PEFT models with EP or quantization. All changes in nemo_automodel/components/checkpoint/stateful_wrappers.py:

Load: _set_peft_state_dict() — matches saved tensors by name, redistributes full tensors back into EP DTensor shards.

Test

loss curve matched now: https://wandb.ai/nvidia/automodel-dev-zhiyul/workspace?nw=9bbvcvbsr3k

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

copy-pr-bot · 2026-02-18T18:03:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ZhiyuLi-Nvidia · 2026-02-18T18:10:13Z

/ok to test 0751476

fix: correctly resume moe lora ckpt

0751476

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia requested review from HuiyingLi, adil-a, akoumpa and hemildesai as code owners February 18, 2026 18:03

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 18:11 Inactive

copy-pr-bot bot temporarily deployed to test February 18, 2026 18:11 Inactive

thomasdhc added the r0.3.0 Add for cherry-pick into release branch r0.3.0 label Feb 18, 2026

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 19:15 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 18, 2026 19:26 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 19:26 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 20:04 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 20:22 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correctly resume moe lora ckpt#1325

fix: correctly resume moe lora ckpt#1325
ZhiyuLi-Nvidia wants to merge 1 commit intomainfrom
zhiyul/fix_moe_lora_resume

ZhiyuLi-Nvidia commented Feb 18, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 18, 2026

Uh oh!

ZhiyuLi-Nvidia commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

ZhiyuLi-Nvidia commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Test

Additional Information

Uh oh!

copy-pr-bot bot commented Feb 18, 2026

Uh oh!

ZhiyuLi-Nvidia commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

ZhiyuLi-Nvidia commented Feb 18, 2026 •

edited

Loading