Skip to content

Conversation

xyang16
Copy link
Contributor

@xyang16 xyang16 commented Jul 17, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

This PR is to support running eagle speculative decoding on deepseek model. Changed the following file:

  • deepseek_eagle.py: deepseek eagle model definition
  • registry.py: add the model to registry

Test Plan

Added test in test_spec_decode.py:

pytest -s -v tests/v1/e2e/test_spec_decode.py

vllm serve cmd:

export VLLM_USE_V1=1
export VLLM_MLA_DISABLE=1
vllm serve deepseek-ai/DeepSeek-R1 \
    --port 8080 \
    --tensor-parallel-size 8 \
    --max-model-len 8192 \
    --max-num-seqs 8 \
    --trust-remote-code \
    --speculative-config '{"method": "eagle", "model":"eagle618/eagle-deepseek-r1", "num_speculative_tokens": 3}'

Test Result

The following unit tests passed:

tests/v1/e2e/test_spec_decode.py::test_eagle_correctness[FLASH_ATTN_VLLM_V1-deepseek_eagle]
tests/v1/e2e/test_spec_decode.py::test_eagle_correctness[TREE_ATTN-deepseek_eagle]

serve deepseek-ai/DeepSeek-R1 and benchmarking with llmperf:

INFO 07-16 18:05:38 [metrics.py:87] SpecDecoding metrics: Draft acceptance rate: 58.4%, Mean acceptance length: 2.75, Accepted: 163 tokens, Drafted: 279 tokens, Per-position acceptance rate: 0.882, 0.559, 0.312
INFO 07-16 18:08:09 [metrics.py:87] SpecDecoding metrics: Draft acceptance rate: 63.3%, Mean acceptance length: 2.90, Accepted: 167 tokens, Drafted: 264 tokens, Per-position acceptance rate: 0.886, 0.670, 0.341

(Optional) Documentation Update

@mergify mergify bot added deepseek Related to DeepSeek models new-model Requests to new models speculative-decoding labels Jul 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Eagle speculative decoding with Deepseek models. I've found a few critical issues in the implementation that will prevent it from working correctly. The model implementation in deepseek_eagle.py incorrectly handles hidden state dimensions and is missing the lm_head layer, which will cause runtime errors. Additionally, the model registry key in registry.py seems to be incorrect, which would prevent the model from being loaded.

Comment on lines +62 to +73
self.fc = nn.Linear(
self.config.model.hidden_size * 2,
self.config.model.hidden_size,
bias=False,
)

self.enorm = RMSNorm(self.config.hidden_size,
eps=self.config.rms_norm_eps)
self.hnorm = RMSNorm(self.config.hidden_size,
eps=self.config.rms_norm_eps)
self.norm = RMSNorm(self.config.hidden_size,
eps=self.config.rms_norm_eps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The implementation of DeepseekV2Model assumes that the draft model and the target model share the same hidden size. For instance, self.hnorm is initialized with the draft model's hidden size (self.config.hidden_size) but is applied to hidden_states from the target model.

This assumption is incorrect for the models used in testing (deepseek-r1 has a hidden size of 4096, while eagle-deepseek-r1 has 1024), and will lead to a runtime error due to shape mismatch.

To fix this, you should explicitly use the hidden sizes from both the draft and target model configurations. You can access the target model's configuration via vllm_config.model_config.

        target_config = vllm_config.model_config.hf_config
        draft_hidden_size = self.config.hidden_size
        target_hidden_size = target_config.hidden_size

        self.fc = nn.Linear(
            draft_hidden_size + target_hidden_size,
            draft_hidden_size,
            bias=False,
        )

        self.enorm = RMSNorm(draft_hidden_size,
                             eps=self.config.rms_norm_eps)
        self.hnorm = RMSNorm(target_hidden_size,
                             eps=target_config.rms_norm_eps)
        self.norm = RMSNorm(draft_hidden_size,
                            eps=self.config.rms_norm_eps)

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@Ja1Zhou
Copy link

Ja1Zhou commented Jul 17, 2025

Hi! I tried installing this pr from source. But got

OSError: eagle618/eagle-deepseek-r1 does not appear to have a file named configuration_deepseek.py. Checkout 'https://huggingface.co/eagle618/eagle-deepseek-r1/tree/main' for available files.

Should the auto_map field of config.json be fixed?

@xyang16
Copy link
Contributor Author

xyang16 commented Jul 17, 2025

Hi! I tried installing this pr from source. But got

OSError: eagle618/eagle-deepseek-r1 does not appear to have a file named configuration_deepseek.py. Checkout 'https://huggingface.co/eagle618/eagle-deepseek-r1/tree/main' for available files.

Should the auto_map field of config.json be fixed?

Thanks for your comment, fixed now.

@xyang16 xyang16 changed the title [v1] Support deepseek with eagle [Model] Support deepseek with eagle Jul 17, 2025
@Ja1Zhou
Copy link

Ja1Zhou commented Jul 17, 2025

Hi! I tried installing this pr from source. But got

OSError: eagle618/eagle-deepseek-r1 does not appear to have a file named configuration_deepseek.py. Checkout 'https://huggingface.co/eagle618/eagle-deepseek-r1/tree/main' for available files.

Should the auto_map field of config.json be fixed?

Thanks for your comment, fixed now.

Hi! I tried installing this pr from source. But got

OSError: eagle618/eagle-deepseek-r1 does not appear to have a file named configuration_deepseek.py. Checkout 'https://huggingface.co/eagle618/eagle-deepseek-r1/tree/main' for available files.

Should the auto_map field of config.json be fixed?

Thanks for your comment, fixed now.

Amazing work!

I wonder if you could share how you got eagle618/eagle-deepseek-r1? As this pr could also improve DS V3 etc. Thank you!

@xyang16 xyang16 force-pushed the eagle branch 4 times, most recently from 6981998 to c4cda03 Compare July 18, 2025 02:05
Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a eagle checkpoint to test with this? If you have some numbers that would be great.

Llama 4 EAGLE was landed recently, so I do think we can probably do the same wrt to tests for this (if you need some references for test cases)

@xyang16
Copy link
Contributor Author

xyang16 commented Jul 29, 2025

Do you have a eagle checkpoint to test with this? If you have some numbers that would be great.

Llama 4 EAGLE was landed recently, so I do think we can probably do the same wrt to tests for this (if you need some references for test cases)

Thanks for your review. I have tested with the checkpoint eagle618/eagle-deepseek-r1.

Also added unit test case.

stacked_params_mapping = [
# (param_name, shard_name, shard_id)
("gate_up_proj", "gate_proj", 0),
("gate_up_proj", "up_proj", 1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be made compatible with the fused_qkv_a_proj optimization from #21116? I have observed multiple issues with weight loading in MTP not being consistent with the DeepSeek base model weight loading. Will similar issues apply here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the stacked_params_mapping. Thanks!

hidden_states = self.fc(inputs)

# masking inputs at position=0
hidden_states[positions == 0] = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have been many discussions in the community about how to properly handle the rotated input slot, but this does not seem in line with the final state. If I recall correctly, there was concern that overwriting the hidden states to zero will give out-of-distribution results during attention. See the other EAGLE implementations in vLLM (such as llama_eagle.py) for reference.

Copy link
Contributor Author

@xyang16 xyang16 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's discussion this will mess up the attention normalization. I have removed this. Please review. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see deepseek_mtp.py has also masked the hidden states to 0: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/deepseek_mtp.py#L74

I can remove the line in deepseek_mtp.py together in this PR. But I can leave it there too. Let me know how you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps best to leave it pending a complete study of impact on AL for MTP. If there isn't a github issue for this task, please create one


inputs = torch.cat(
[self.enorm(input_embeds),
self.hnorm(hidden_states)], dim=-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like too many norms being applied. In the Llama_Eagle reference code, the input layernorm to each layer is disabled, and IIRC there is no output layernorm. Here, there are two norms applied to the input (pre-concat and input-layernorm after concat) and two more norms applied after (post_attention_layernorm and self.norm). This does not seem correct.

Copy link
Contributor Author

@xyang16 xyang16 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken a look at the deepseek_mtp.py at https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/deepseek_mtp.py#L64. The only difference is output self.norm. But in our benchmarking, we found that including the output norm will increase acceptance rate.

@benchislett
Copy link
Collaborator

Acceptance rate is around 60% for vanilla eagle head (and better acceptance rate for fine tuned eagle head).

What does this mean? What are the weights of a "vanilla eagle head" in this case?

@xyang16 xyang16 force-pushed the eagle branch 2 times, most recently from 582dc36 to e26554a Compare August 14, 2025 23:37
@xyang16
Copy link
Contributor Author

xyang16 commented Aug 14, 2025

Acceptance rate is around 60% for vanilla eagle head (and better acceptance rate for fine tuned eagle head).

What does this mean? What are the weights of a "vanilla eagle head" in this case?

@benchislett Thanks for your review! I mean people can further fine tune the weights and get better acceptance rate. The "vanilla eagle head" in this case is eagle618/eagle-deepseek-r1.

@xyang16 xyang16 force-pushed the eagle branch 2 times, most recently from 498955f to 7d3a40f Compare August 15, 2025 03:30
Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation now seems more in-line with the MTP implementation. There are still differences between how we handle EAGLE and MTP models (whether norms are applied to input_layernorm or not, normed before output, for example) and this PR blends the two by implementing an EAGLE class in a manner more consistent with MTP.

We should try to find some way to unify implementations and reconcile the differences, but this is probably not the PR to bear that burden. For now, this will suffice and can be extended if future eagle-style MTP modules are released with slight differences in implementation.

@xyang16 xyang16 requested a review from aarnphm August 18, 2025 22:06
Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping given Benjamin approved.

@simon-mo simon-mo enabled auto-merge (squash) August 19, 2025 22:06
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 19, 2025
Signed-off-by: Xin Yang <[email protected]>
auto-merge was automatically disabled August 20, 2025 05:27

Head branch was pushed to by a user without write access

@DarkLight1337 DarkLight1337 merged commit 83e69a0 into vllm-project:main Aug 20, 2025
42 checks passed
@xyang16 xyang16 deleted the eagle branch August 20, 2025 11:04
divakar-amd pushed a commit to divakar-amd/vllm_upstream that referenced this pull request Aug 20, 2025
cyang49 pushed a commit to cyang49/vllm that referenced this pull request Aug 20, 2025
MatthewBonanni pushed a commit to MatthewBonanni/vllm that referenced this pull request Aug 20, 2025
djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025
BoyuanFeng pushed a commit to BoyuanFeng/vllm that referenced this pull request Aug 21, 2025
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 22, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
juuice-lee pushed a commit to juuice-lee/vllm-moe.code that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
dumb0002 pushed a commit to dumb0002/vllm that referenced this pull request Aug 28, 2025
2015aroras pushed a commit to 2015aroras/vllm that referenced this pull request Aug 29, 2025
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
@gyou2021
Copy link

gyou2021 commented Sep 1, 2025

Why export VLLM_MLA_DISABLE=1? Is it ok that DeepSeek-R1 inference with VLLM_MLA_DISABLE=0 and deepseek_eagle with VLLM_MLA_DISABLE= 0 ?

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deepseek Related to DeepSeek models new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants