Modularize fused experts and integrate PPLX kernels #15956

bnellnm · 2025-04-02T17:11:55Z

This PR defines a set of base classes used to make MoE kernels more modular. The goal is to be able to utilize different communication mechanisms with any fused MoE kernel without needing to have combinatoric implementations.

The fused moe kernels are broken down into the following components:

[Router] → [Quantize-Dispatch] → [Permute-Experts-Unpermute] → [Combine]

Each component will be independent of the others except for [Quantize-Dispatch] and `[Combine] (see below). The components can then be mixed and matched with so that DP+EP can be supported easily for multiple MoE kernel implementations.

The following main classes are defined:

FusedMoEQuantizeDispatchCombine - an abstract base class for quantization, dispatching and combing. The dispatch method takes care of any needed quantization and the combine method applies weights and does the final reduction of the output.
FusedMoEPermuteExpertsUnpermute - an abstract base class for the main fused MoE operation. One important feature to note is that this class does not apply topk weights or reduce the final output.
FusedMoEModularKernel - an interface class that combines a FusedMoEQuantizeDispatchCombine and a
FusedMoEPermuteExpertsUnpermute to provide the standard fused MoE kernel interface.
StandardDispatchCombine - a concrete class that can be used for serial Triton, DeepGemm and CUTLASS moe implementations.

The implementations for the DeepGemm and CUTLASS moe functions have been replaced with the modularized versions. There's also a modularized version for the Triton kernels but it will not be enabled by default.

[Quantize-Dispatch] and [Combine] functionality are bundled into a single class FusedMoEQuantizeDispatchCombine since they could use collective communication mechanisms that need to be consistent.

cc @ElizaWszola , @varun-sundar-rabindranath

github-actions · 2025-04-02T17:12:03Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Bill Nell <[email protected]>

tlrmchlsmth · 2025-04-04T01:04:05Z

For readers: We're doing this to support the pplx-kernel integration. We can use this structure for DeepEP as well.

Right now our fused MoE is implemented as something very very roughly like:

[Router] → [Quantize] → [Experts + topk_weight scaling + reduction]

This is a problem as the topk_weight scaling and reduction now need to happen during combine. We need to fit dispatch in there as well.

This PR defines a set of base classes used to make MoE kernels more modular. The goal is to be able to utilize different communication mechanisms with any fused MoE kernel without needing to have combinatoric implementations.

The fused moe kernels are broken down into the following components:

[Router] → [Quantize-Dispatch] → [Permute-Experts-Unpermute] → [Combine]

☝️ This is what @bnellnm and I agreed on. However:

The other option we originally considered was:

[Router] → [Quantize-Dispatch-Permute] → [Experts] → [Unpermute-Combine]

Right now I am thinking that permute/unpermute will (unfortunately) depend both on the implementation of both dispatch/combine and experts, so we should consider breaking that out.

tlrmchlsmth

nice and clean

tests/kernels/test_moe.py

vllm/model_executor/layers/fused_moe/deep_gemm_moe.py

vllm/model_executor/layers/fused_moe/pplx_dispatch_combine.py

Signed-off-by: Bill Nell <[email protected]>

vllm/model_executor/layers/fused_moe/pplx_dispatch_combine.py

Signed-off-by: Bill Nell <[email protected]>

@abcdabcd987

…nt for users. (#2) Being able to query some of the setup parameters from the AllToAll class would make client code a bit simpler/safer, e.g. see pplx_dispatch_combine.py from vllm-project/vllm#15956 cc @abcdabcd987 , @tlrmchlsmth Signed-off-by: Bill Nell <[email protected]>

abcdabcd987 · 2025-04-04T20:43:05Z

One thing I forgot to put in our examples is -- Please call the destructor! ata.destroy()

bnellnm · 2025-04-04T21:21:17Z

One thing I forgot to put in our examples is -- Please call the destructor! ata.destroy()

Do all references to an AllToAll need to be destroyed? The current plan is to have a cache (that will be in a different PR) manage all the AllToAll instances and the PplxDispatchCombine would hold on to a reference of one of the cached objects.

abcdabcd987 · 2025-04-04T21:24:45Z

Do all references to an AllToAll need to be destroyed?

No. Call destroy() only when you are shutting down the engine (or removing the model from GPU, etc...)

You are right to cache the object. It is supposed to be reused across layers and across runs.

Signed-off-by: Bill Nell <[email protected]>

mergify · 2025-04-09T02:17:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

bnellnm added 14 commits April 3, 2025 18:31

moe refactoring

9d0d12e

Signed-off-by: Bill Nell <[email protected]>

module deepgemm moe working

5398268

Signed-off-by: Bill Nell <[email protected]>

working deep gemm, wip cutlass

9be547d

Signed-off-by: Bill Nell <[email protected]>

working cutlass

0ad5474

Signed-off-by: Bill Nell <[email protected]>

deepgemm working again

e53a9af

Signed-off-by: Bill Nell <[email protected]>

cutlass working again

0210e6e

Signed-off-by: Bill Nell <[email protected]>

cutlass working again

6fd3272

Signed-off-by: Bill Nell <[email protected]>

fix inplace, format and name cleanups

4f98d66

Signed-off-by: Bill Nell <[email protected]>

fix inplace, format + name cleanups

66fda53

Signed-off-by: Bill Nell <[email protected]>

test improvements

9aa2581

Signed-off-by: Bill Nell <[email protected]>

make modular triton classes, fix edge cases

0281539

Signed-off-by: Bill Nell <[email protected]>

fix outplace bug

14fa835

Signed-off-by: Bill Nell <[email protected]>

refactor dispatch/combine stuff

bfa3497

Signed-off-by: Bill Nell <[email protected]>

initial pplx dispatch/combine class

fc3243d

Signed-off-by: Bill Nell <[email protected]>

bnellnm force-pushed the modular-fused-experts branch from 96b7d5b to fc3243d Compare April 3, 2025 19:39

bnellnm added 5 commits April 3, 2025 20:41

merge triton dispatch into standard, add some comments

5f0a050

Signed-off-by: Bill Nell <[email protected]>

format

e2b3212

Signed-off-by: Bill Nell <[email protected]>

comments

6cb1e89

Signed-off-by: Bill Nell <[email protected]>

fix linter

04c779b

Signed-off-by: Bill Nell <[email protected]>

fix more linter stuff

291aacc

Signed-off-by: Bill Nell <[email protected]>

This was referenced Apr 3, 2025

[RFC]: Data Parallel Attention and Expert Parallel MoEs #16037

Open

Integrate PPLX-kernels #16039

Open

cleanup for review

8f1821f

Signed-off-by: Bill Nell <[email protected]>

bnellnm marked this pull request as ready for review April 3, 2025 23:20

bnellnm requested review from tlrmchlsmth and WoosukKwon as code owners April 3, 2025 23:20

tlrmchlsmth reviewed Apr 4, 2025

View reviewed changes

review comments

e3ba365

Signed-off-by: Bill Nell <[email protected]>

forgot return

92806c4

Signed-off-by: Bill Nell <[email protected]>

tlrmchlsmth reviewed Apr 4, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/pplx_dispatch_combine.py Show resolved Hide resolved

tlrmchlsmth reviewed Apr 4, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/pplx_dispatch_combine.py Outdated Show resolved Hide resolved

bnellnm mentioned this pull request Apr 4, 2025

Remember some constructor args in AllToAll to make life more convenient for users. ppl-ai/pplx-kernels#2

Merged

add dp_rank_num_tokens to DPMetadata

fb033dc

Signed-off-by: Bill Nell <[email protected]>

bnellnm changed the title ~~Modular fused experts~~ Modularize fused experts and integrate pplx kernels Apr 4, 2025

bnellnm changed the title ~~Modularize fused experts and integrate pplx kernels~~ Modularize fused experts and integrate PPLX kernels Apr 4, 2025

better check for fp8 in _fp8_permute

63f1297

Signed-off-by: Bill Nell <[email protected]>

mergify bot added the needs-rebase label Apr 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modularize fused experts and integrate PPLX kernels #15956

Modularize fused experts and integrate PPLX kernels #15956

bnellnm commented Apr 2, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 2, 2025

tlrmchlsmth commented Apr 4, 2025

tlrmchlsmth left a comment

abcdabcd987 commented Apr 4, 2025

bnellnm commented Apr 4, 2025

abcdabcd987 commented Apr 4, 2025

mergify bot commented Apr 9, 2025

Modularize fused experts and integrate PPLX kernels #15956

Are you sure you want to change the base?

Modularize fused experts and integrate PPLX kernels #15956

Conversation

bnellnm commented Apr 2, 2025 • edited by github-actions bot Loading

github-actions bot commented Apr 2, 2025

tlrmchlsmth commented Apr 4, 2025

tlrmchlsmth left a comment

Choose a reason for hiding this comment

abcdabcd987 commented Apr 4, 2025

bnellnm commented Apr 4, 2025

abcdabcd987 commented Apr 4, 2025

mergify bot commented Apr 9, 2025

bnellnm commented Apr 2, 2025 •

edited by github-actions bot

Loading