Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expert Parallelism (EP) Support for DeepSeek V2 #12583

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

cakeng
Copy link

@cakeng cakeng commented Jan 30, 2025

The current vLLM execution only supports TP when running MoE models.

This PR adds support for Expert Parallelism (EP) for the FusedMoE Kernel and DeepSeek V2 model, which should be extendable to V3 and other MoE models as well.

Use –expert_parallel_size engine argument to specify the EP size the FuseMoE kernel uses.

Currently does not work with CUDA graph, but the problem with CUDA graph seems to be on the current attention layer (MLA?) implementation. I tried CUDA graph + EP on the previous attention layer implementation (commit 60284b5), and it works fine.

Doc

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comment but overall LGTM

vllm/config.py Outdated Show resolved Hide resolved
vllm/config.py Outdated Show resolved Hide resolved
vllm/config.py Show resolved Hide resolved
vllm/distributed/parallel_state.py Show resolved Hide resolved
vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved
vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved
vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved
@youkaichao
Copy link
Member

need to check the user-interface, but I feel right now we can just reuse the tp size for ep?

in the future, when we have DP, EP size will automatically be DP x TP.

@LucasWilkinson
Copy link
Contributor

but the problem with CUDA graph seems to be on the current attention layer (MLA?) implementation.

can you please elaborate on this, a bit? MLA + CUDA graphs + TP is working fine on main as far as I am aware

@mergify mergify bot added the v1 label Feb 4, 2025
@cakeng
Copy link
Author

cakeng commented Feb 4, 2025

@youkaichao The current design support TP within an EP, but we can easily change that to have EP only on MoE layers. I think we will need more discussion with others on that design decision, the current implementation of EP+TP is based on a discussion with @WoosukKwon and @simon-mo.

@LucasWilkinson I just merged the main branch and CUDA graph is now working with EP+TP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants