Skip to content

add support for simplefsdp+ep #1529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

ruisizhang123
Copy link
Contributor

@ruisizhang123 ruisizhang123 commented Aug 5, 2025

As titled, this pr adds support for simplefsdp+ep.

In a SimpleFSDP + EP tlparse, you can see the all-to-all co-exists with replicated shared experts (_grouped_mm), which means we could potentially reorder them for overlapping.

Profiler Trace & Correctness (eager-mode)

The following results are benchmarks on 8 H100

  1. FSDP + EP(degree=2)
  • Loss: As seen the losses almost match between FSDP2+EP and SimpleFSDP+EP.

  • Trace: The first stream is FSDP (on dp_shard_cp dim), the second stream is all_to_all in token dispatch, and the third stream is FSDP (on dp_shard_mod_ep). It's on different streams probably because its submesh names are different, but it should be or we hope they will be on the same....

Screenshot 2025-08-06 at 4 57 30 PM Screenshot 2025-08-06 at 4 55 47 PM
  1. FSDP + TP(degree=2)
  • Loss: As seen the losses almost match between FSDP2+TP and SimpleFSDP+TP.

  • Trace: The first stream is FSDP, the second stream is TP.

Screenshot 2025-08-06 at 5 06 14 PM Screenshot 2025-08-06 at 5 11 04 PM
  1. FSDP + TP(degree=2) + EP(degree=2)
  • Loss: As seen the losses almost match between FSDP2+TP+EP and SimpleFSDP+TP+EP.

  • Trace: The first stream is FSDP communication (on dp_shard_cp dim), the second stream is TP communication, the third stream is all-to-all for token dispatch, and the fourth stream is FSDP communication (on dp_shard_mod_ep dim).

Screenshot 2025-08-07 at 10 43 15 AM Screenshot 2025-08-06 at 5 11 04 PM
  1. HSDP + TP(degree=2) + EP(degree=2)
  • Loss: As seen the losses almost match between FSDP2(HSDP)+TP+EP and SimpleFSDP(HSDP)+TP+EP.

  • Trace: The first stream is FSDP communication (on dp_shard_cp dim), the second stream is TP communication, the third stream is all-to-all for token dispatch, the fourth stream is FSDP communication (on dp_shard_mod_ep dim), and the fifth stream is DDP communication.

Screenshot 2025-08-07 at 11 26 19 AM Screenshot 2025-08-07 at 11 31 56 AM

What is not working:

  1. AC + EP has three graph breaks:
  • one in buffer mutation in self.input_splits = num_tokens_per_expert.view(device_mesh.shape[0], -1).sum(dim=1) in expert_parallel.py after confirming with @xmfan LINK;
  • one in the _A2A class as the temporary fix for AC leak... LINK
  • there is also one graph break in all_to_all_single_autograd's input, which is converted using tolist(). But for some magical reason, I found if inputs to all_to_all_single is parsed as output_split_sizes.tolist() instead of output_split_sizes, there will be no graph break.... After looking into the tlparse, output_split_sizes is still treated as a tensor, but when copy data out of output_split_sizes, the triton code will do an additional .item(). LINK
  1. All-gather in the backward --> This also happens to FSDP + TP(without EP). There should be sth wrong with checkpoint in deepseek_v3 in general after some discussion in with @tianyu-l
  2. The losses are quite close, but not perfectly match. This is potentially because the checkpoint implementing reshard_after_forward is not working. Thus, the behavior of two are different.

cc. @anijain2305

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 5, 2025
@ruisizhang123 ruisizhang123 marked this pull request as draft August 5, 2025 16:35
@ruisizhang123 ruisizhang123 force-pushed the ruisi/simplefsdp_ep branch 8 times, most recently from 058cf25 to 407f5f8 Compare August 7, 2025 18:46
@ruisizhang123 ruisizhang123 changed the title [WIP] add support for simplefsdp+ep add support for simplefsdp+ep Aug 7, 2025
@ruisizhang123 ruisizhang123 marked this pull request as ready for review August 7, 2025 19:05
@ruisizhang123 ruisizhang123 force-pushed the ruisi/simplefsdp_ep branch 3 times, most recently from eb5045f to 97bb837 Compare August 14, 2025 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant