-
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modularize fused experts and integrate PPLX kernels #15956
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
96b7d5b
to
fc3243d
Compare
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
For readers: We're doing this to support the pplx-kernel integration. We can use this structure for DeepEP as well. Right now our fused MoE is implemented as something very very roughly like:
This is a problem as the topk_weight scaling and reduction now need to happen during
☝️ This is what @bnellnm and I agreed on. However: The other option we originally considered was:
Right now I am thinking that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice and clean
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
…nt for users. (#2) Being able to query some of the setup parameters from the AllToAll class would make client code a bit simpler/safer, e.g. see pplx_dispatch_combine.py from vllm-project/vllm#15956 cc @abcdabcd987 , @tlrmchlsmth Signed-off-by: Bill Nell <[email protected]>
One thing I forgot to put in our examples is -- Please call the destructor! |
Do all references to an |
No. Call You are right to cache the object. It is supposed to be reused across layers and across runs. |
Signed-off-by: Bill Nell <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
This PR defines a set of base classes used to make MoE kernels more modular. The goal is to be able to utilize different communication mechanisms with any fused MoE kernel without needing to have combinatoric implementations.
The fused moe kernels are broken down into the following components:
Each component will be independent of the others except for
[Quantize-Dispatch]
and `[Combine] (see below). The components can then be mixed and matched with so that DP+EP can be supported easily for multiple MoE kernel implementations.The following main classes are defined:
FusedMoEQuantizeDispatchCombine
- an abstract base class for quantization, dispatching and combing. Thedispatch
method takes care of any needed quantization and thecombine
method applies weights and does the final reduction of the output.FusedMoEPermuteExpertsUnpermute
- an abstract base class for the main fused MoE operation. One important feature to note is that this class does not apply topk weights or reduce the final output.FusedMoEModularKernel
- an interface class that combines aFusedMoEQuantizeDispatchCombine
and aFusedMoEPermuteExpertsUnpermute
to provide the standard fused MoE kernel interface.StandardDispatchCombine
- a concrete class that can be used for serial Triton, DeepGemm and CUTLASS moe implementations.The implementations for the DeepGemm and CUTLASS moe functions have been replaced with the modularized versions. There's also a modularized version for the Triton kernels but it will not be enabled by default.
[Quantize-Dispatch]
and[Combine]
functionality are bundled into a single classFusedMoEQuantizeDispatchCombine
since they could use collective communication mechanisms that need to be consistent.cc @ElizaWszola , @varun-sundar-rabindranath