Skip to content

Support encoder-only models without KV-Cache #21270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jul 26, 2025

Conversation

maxdebayser
Copy link
Contributor

@maxdebayser maxdebayser commented Jul 20, 2025

Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata.

This PR combines elements of PRs
#21088
and #19988

Summary of changes:

Flash Attention Backend:

  • Implement encoder self-attention support without using KV cache

Scheduler:

  • Disable chunked prefill for models without KV cache

GPU Model Runner:

  • Implement encoder-only attention metadata building for self-attention

Related to:

This PR is co-authored with @russellb. It borrows all of the encoder-only attention code from his PR #21088 but leaves out the cross-encoder and encoder attention.

cc: @DarkLight1337

Add support for encoder models such as BERT which don't support
a KV cache due to the non-causal attention. Since the KV Cache
Spec is used to build the attention metadata for decoder models,
this PR initializes the attention metadata builds for encoder-only
models directly from the layers and adds a function to build the
attention metadata.

This PR combines elements of PRs
vllm-project#21088
and vllm-project#19988

Summary of changes:

**Flash Attention Backend:**
- Implement encoder self-attention support without using KV cache

**Scheduler:**
- Disable chunked prefill for models without KV cache

**GPU Model Runner:**
- Implement encoder-only attention metadata building for self-attention

Related to:
- V0 deprecation: vllm-project#18571
- 2025 Q3 roadmap: vllm-project#20336

Signed-off-by: Max de Bayser <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Jul 20, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for encoder-only models without a KV-cache. The changes are well-structured and cover the necessary modifications in the attention backend, scheduler, and model runner. I have identified areas where the implementation's strictness could limit future extensibility. Specifically, the error handling and assertions in GPUModelRunner are too restrictive and should be made more flexible to accommodate potential future model architectures.

@maxdebayser
Copy link
Contributor Author

maxdebayser commented Jul 21, 2025

@DarkLight1337 this PR should enable support for all bert models except for the classifier models that require token type ids. But that can be left as a future PR as there are several implementation alternatives. Since the KV cache is disabled in this PR, it require much less changes than PR #19988

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 21, 2025

cc @WoosukKwon @LucasWilkinson it would be best for you two to review this to ensure that the refactoring fits your design.

Copy link

mergify bot commented Jul 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 21, 2025
@mergify mergify bot removed the needs-rebase label Jul 21, 2025
@russellb russellb dismissed their stale review July 22, 2025 01:03

my comments were addressed, but it needs review from others since I'm a co-author:

cc @WoosukKwon @LucasWilkinson it would be best for you two to review this to ensure that the refactoring fits your design.

@russellb russellb added this to the v0.10.0 milestone Jul 22, 2025
@mergify mergify bot added the documentation Improvements or additions to documentation label Jul 23, 2025
@russellb russellb added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 23, 2025
Signed-off-by: Max de Bayser <[email protected]>
To fix the test I switched to uniproc processor, but
now I'm getting weird issues like

```
torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0x7ef9f68ca600>' raised:
AttributeError: module 'torch._tensor' has no attribute 'split'
```

Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for doing it! Left some comments.

Comment on lines +113 to +117
if len(kv_cache_config.kv_cache_groups) == 0:
# Encoder models without KV cache don't support
# chunked prefill. But do SSM models?
logger.info("Disabling chunked prefill for model without KVCache")
vllm_config.scheduler_config.chunked_prefill_enabled = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is quite hacky. Can we check this in a more robust way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But, AFAIK, it's only after the model is loaded that we truly know if there is a KV cache or not :/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSM models are regarded as need kv cache in v1, thus len(kv_cache_groups) > 0. And SSM supports chunked prefill. So the branch is fine for SSM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming! I'll remove the comment in a follow-up PR

@@ -1649,7 +1649,8 @@ def _set_default_args_v1(self, usage_context: UsageContext,

if (self.max_num_seqs is None
and usage_context in default_max_num_seqs):
self.max_num_seqs = default_max_num_seqs[usage_context]
self.max_num_seqs = min(default_max_num_seqs[usage_context],
self.max_num_batched_tokens or sys.maxsize)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need sys.maxsize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just because self.max_num_batched_tokens can be unset, in this case the min will take the value default_max_num_seqs[usage_context]. It's just to avoid writing an if.

Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
@maxdebayser
Copy link
Contributor Author

@DarkLight1337 , this is approved and all tests are passing.

@DarkLight1337 DarkLight1337 merged commit 1cd6eab into vllm-project:main Jul 26, 2025
70 checks passed
yma11 added a commit to yma11/vllm that referenced this pull request Jul 29, 2025
…lm-project#280)

Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Co-authored-by: Maximilien de Bayser <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants