add support for simplefsdp+ep #1529

ruisizhang123 · 2025-08-05T16:35:13Z

As titled, this pr adds support for simplefsdp+ep.

In a SimpleFSDP + EP tlparse, you can see the all-to-all co-exists with replicated shared experts (_grouped_mm), which means we could potentially reorder them for overlapping.

Profiler Trace & Correctness (eager-mode)

The following results are benchmarks on 8 H100

FSDP + EP(degree=2)

Loss: As seen the losses almost match between FSDP2+EP and SimpleFSDP+EP.
Trace: The first stream is FSDP (on dp_shard_cp dim), the second stream is all_to_all in token dispatch, and the third stream is FSDP (on dp_shard_mod_ep). It's on different streams probably because its submesh names are different, but it should be or we hope they will be on the same....

FSDP + TP(degree=2)

Loss: As seen the losses almost match between FSDP2+TP and SimpleFSDP+TP.
Trace: The first stream is FSDP, the second stream is TP.

FSDP + TP(degree=2) + EP(degree=2)

Loss: As seen the losses almost match between FSDP2+TP+EP and SimpleFSDP+TP+EP.
Trace: The first stream is FSDP communication (on dp_shard_cp dim), the second stream is TP communication, the third stream is all-to-all for token dispatch, and the fourth stream is FSDP communication (on dp_shard_mod_ep dim).

HSDP + TP(degree=2) + EP(degree=2)

Loss: As seen the losses almost match between FSDP2(HSDP)+TP+EP and SimpleFSDP(HSDP)+TP+EP.
Trace: The first stream is FSDP communication (on dp_shard_cp dim), the second stream is TP communication, the third stream is all-to-all for token dispatch, the fourth stream is FSDP communication (on dp_shard_mod_ep dim), and the fifth stream is DDP communication.

What is not working:

AC + EP has three graph breaks:

one in buffer mutation in self.input_splits = num_tokens_per_expert.view(device_mesh.shape[0], -1).sum(dim=1) in expert_parallel.py after confirming with @xmfan LINK;
one in the _A2A class as the temporary fix for AC leak... LINK
there is also one graph break in all_to_all_single_autograd's input, which is converted using tolist(). But for some magical reason, I found if inputs to all_to_all_single is parsed as output_split_sizes.tolist() instead of output_split_sizes, there will be no graph break.... After looking into the tlparse, output_split_sizes is still treated as a tensor, but when copy data out of output_split_sizes, the triton code will do an additional .item(). LINK

All-gather in the backward --> This also happens to FSDP + TP(without EP). There should be sth wrong with checkpoint in deepseek_v3 in general after some discussion in with @tianyu-l
The losses are quite close, but not perfectly match. This is potentially because the checkpoint implementing reshard_after_forward is not working. Thus, the behavior of two are different.

cc. @anijain2305

ruisizhang123 requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 5, 2025 16:35

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 5, 2025

ruisizhang123 marked this pull request as draft August 5, 2025 16:35

ruisizhang123 force-pushed the ruisi/simplefsdp_ep branch 8 times, most recently from 058cf25 to 407f5f8 Compare August 7, 2025 18:46

ruisizhang123 changed the title ~~[WIP] add support for simplefsdp+ep~~ add support for simplefsdp+ep Aug 7, 2025

ruisizhang123 force-pushed the ruisi/simplefsdp_ep branch from 407f5f8 to 63eec2d Compare August 7, 2025 19:05

ruisizhang123 marked this pull request as ready for review August 7, 2025 19:05

ruisizhang123 mentioned this pull request Aug 8, 2025

No backward all-gather in deepseek simplefsdp+ep/tp #1547

Open

ruisizhang123 force-pushed the ruisi/simplefsdp_ep branch 3 times, most recently from eb5045f to 97bb837 Compare August 14, 2025 22:16

add support for simplefsdp+ep

273e7b4

ruisizhang123 force-pushed the ruisi/simplefsdp_ep branch from 97bb837 to 273e7b4 Compare August 14, 2025 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add support for simplefsdp+ep #1529

add support for simplefsdp+ep #1529

Uh oh!

ruisizhang123 commented Aug 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

add support for simplefsdp+ep #1529

Are you sure you want to change the base?

add support for simplefsdp+ep #1529

Uh oh!

Conversation

ruisizhang123 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Profiler Trace & Correctness (eager-mode)

What is not working:

Uh oh!

Uh oh!

ruisizhang123 commented Aug 5, 2025 •

edited

Loading