[EP] add initial support for NVSHMEM-based all-to-all #1569

tianyu-l · 2025-08-14T05:45:26Z

As titled. This PR also does some refactoring around grouped_mm calling, as NVSHMEM-based all-to-all takes num_tokens_per_expert and prepares offsets.

What works

when num_local_experts == 1

What doesn't work and needs debugging

when num_local_experts > 1

other TODOs

let multiple MoE layers share the same input / output buffer
add NVSHMEM-based ExpertTensorParallel support (currently only supports ETP=1)

kwen2501 · 2025-08-14T15:28:01Z

torchtitan/experiments/kernels/moe/dispatch.py

+        # TODO: why do we need this clone?
+        return out.clone()


Can you try removing this clone after we added out_buffer.detach() ?

still erroring out if removing this clone

RuntimeError: Output 0 of AllToAllVDev2dBackward is a view and its base or another view of its base has been modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.

xmfan · 2025-08-15T20:00:22Z

torchtitan/distributed/expert_parallel.py

+        self.output_splits = None
+
+    # performing all-to-all dispatch on the input
+    def _token_dispatch(self, mod, inputs, device_mesh):


i think this new implementation will get rid of the need of torch._dynamo.config.capture_scalar_outputs, avoiding the need to handle unbacked symints

[EP] add initial support for NVSHMEM-based all-to-all

b44366d

tianyu-l requested review from kwen2501, sanketpurandare and ngimel August 14, 2025 05:45

tianyu-l requested review from fegin, wwwjn and wconstab as code owners August 14, 2025 05:45

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 14, 2025

tianyu-l requested review from xmfan and danielvegamyhre August 14, 2025 05:54

kwen2501 reviewed Aug 14, 2025

View reviewed changes

xmfan reviewed Aug 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EP] add initial support for NVSHMEM-based all-to-all #1569

[EP] add initial support for NVSHMEM-based all-to-all #1569

tianyu-l commented Aug 14, 2025

Uh oh!

kwen2501 Aug 14, 2025

Uh oh!

tianyu-l Aug 15, 2025

Uh oh!

xmfan Aug 15, 2025

Uh oh!

Uh oh!

[EP] add initial support for NVSHMEM-based all-to-all #1569

Are you sure you want to change the base?

[EP] add initial support for NVSHMEM-based all-to-all #1569

Conversation

tianyu-l commented Aug 14, 2025

Uh oh!

kwen2501 Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

xmfan Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!