[llama4][auxiliary-loss-free load balancing] update expert_bias without backward hooks #1304

hann-wang · 2025-06-16T07:08:00Z

Changes:

Introduce a finalize_model_grads_func attribute in TrainSpec.
Set finalize_model_grads_func to update_router_expert_bias for MoE models.
finalize_model_grads_func is called AFTER gradient accumulation steps.
enable_tp2ep is reserved for [llama4] enable expert parallel on the same device mesh as tp (tp2ep) #1269

Reasons:

Friendly for torch.compile and activation checkpointing.
The original implementation updates expert_bias on each microbatches during gradient accumulation.

hann-wang added 2 commits June 16, 2025 06:51

feat: update expert_bias without backward hooks

fdea778

chore: comments for the all_reduce group

354f336

hann-wang requested review from tianyu-l, fegin and wwwjn as code owners June 16, 2025 07:08

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 16, 2025

tianyu-l mentioned this pull request Jun 18, 2025

compile: turn off fullgraph=True to support llama4 #1182

Open

Provide feedback