Support encoder-only models without KV-Cache #21270

maxdebayser · 2025-07-20T23:14:04Z

Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata.

This PR combines elements of PRs
#21088
and #19988

Summary of changes:

Flash Attention Backend:

Implement encoder self-attention support without using KV cache

Scheduler:

Disable chunked prefill for models without KV cache

GPU Model Runner:

Implement encoder-only attention metadata building for self-attention

Related to:

V0 deprecation: [RFC]: Deprecating vLLM V0 #18571
2025 Q3 roadmap: [Roadmap] vLLM Roadmap Q3 2025 #20336

This PR is co-authored with @russellb. It borrows all of the encoder-only attention code from his PR #21088 but leaves out the cross-encoder and encoder attention.

cc: @DarkLight1337

Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata. This PR combines elements of PRs vllm-project#21088 and vllm-project#19988 Summary of changes: **Flash Attention Backend:** - Implement encoder self-attention support without using KV cache **Scheduler:** - Disable chunked prefill for models without KV cache **GPU Model Runner:** - Implement encoder-only attention metadata building for self-attention Related to: - V0 deprecation: vllm-project#18571 - 2025 Q3 roadmap: vllm-project#20336 Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Russell Bryant <[email protected]>

github-actions · 2025-07-20T23:14:11Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces support for encoder-only models without a KV-cache. The changes are well-structured and cover the necessary modifications in the attention backend, scheduler, and model runner. I have identified areas where the implementation's strictness could limit future extensibility. Specifically, the error handling and assertions in GPUModelRunner are too restrictive and should be made more flexible to accommodate potential future model architectures.

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Max de Bayser <[email protected]>

maxdebayser · 2025-07-21T13:35:24Z

@DarkLight1337 this PR should enable support for all bert models except for the classifier models that require token type ids. But that can be left as a future PR as there are several implementation alternatives. Since the KV cache is disabled in this PR, it require much less changes than PR #19988

tests/entrypoints/openai/test_rerank.py

DarkLight1337 · 2025-07-21T13:48:52Z

cc @WoosukKwon @LucasWilkinson it would be best for you two to review this to ensure that the refactoring fits your design.

mergify · 2025-07-21T16:17:02Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Max de Bayser <[email protected]>

vllm/v1/attention/backends/flash_attn.py

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Max de Bayser <[email protected]>

vllm/v1/worker/gpu_model_runner.py

@WoosukKwon

my comments were addressed, but it needs review from others since I'm a co-author:

cc @WoosukKwon @LucasWilkinson it would be best for you two to review this to ensure that the refactoring fits your design.

Signed-off-by: Max de Bayser <[email protected]>

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Max de Bayser <[email protected]>

To fix the test I switched to uniproc processor, but now I'm getting weird issues like ``` torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0x7ef9f68ca600>' raised: AttributeError: module 'torch._tensor' has no attribute 'split' ``` Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Max de Bayser <[email protected]>

WoosukKwon

thanks for doing it! Left some comments.

WoosukKwon · 2025-07-25T20:23:37Z

vllm/v1/engine/core.py

+        if len(kv_cache_config.kv_cache_groups) == 0:
+            # Encoder models without KV cache don't support
+            # chunked prefill. But do SSM models?
+            logger.info("Disabling chunked prefill for model without KVCache")
+            vllm_config.scheduler_config.chunked_prefill_enabled = False


I think this is quite hacky. Can we check this in a more robust way?

I agree. But, AFAIK, it's only after the model is loaded that we truly know if there is a KV cache or not :/

SSM models are regarded as need kv cache in v1, thus len(kv_cache_groups) > 0. And SSM supports chunked prefill. So the branch is fine for SSM.

Thanks for confirming! I'll remove the comment in a follow-up PR

vllm/v1/attention/backends/flash_attn.py

WoosukKwon · 2025-07-25T20:24:10Z

vllm/engine/arg_utils.py

@@ -1649,7 +1649,8 @@ def _set_default_args_v1(self, usage_context: UsageContext,

        if (self.max_num_seqs is None
                and usage_context in default_max_num_seqs):
-            self.max_num_seqs = default_max_num_seqs[usage_context]
+            self.max_num_seqs = min(default_max_num_seqs[usage_context],
+                                    self.max_num_batched_tokens or sys.maxsize)


Why do we need sys.maxsize?

It's just because self.max_num_batched_tokens can be unset, in this case the min will take the value default_max_num_seqs[usage_context]. It's just to avoid writing an if.

vllm/v1/attention/backends/utils.py

Signed-off-by: Max de Bayser <[email protected]>

maxdebayser · 2025-07-26T10:45:49Z

@DarkLight1337 , this is approved and all tests are passing.

…lm-project#280) Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Russell Bryant <[email protected]>

Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Russell Bryant <[email protected]>

maxdebayser requested review from DarkLight1337, ywang96, robertgshaw2-redhat, simon-mo, aarnphm, WoosukKwon, njhill, comaniac and alexm-redhat as code owners July 20, 2025 23:14

mergify bot added the v1 label Jul 20, 2025

gemini-code-assist bot reviewed Jul 20, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

noooop mentioned this pull request Jul 21, 2025

[Model] Auto resolve default_pooling_type & Optimize prefix caching enable verify logic. #20930

Open

4 tasks

Merge branch 'upstream_main' into v1_encoder_only

3f11075

Signed-off-by: Max de Bayser <[email protected]>

DarkLight1337 reviewed Jul 21, 2025

View reviewed changes

tests/entrypoints/openai/test_rerank.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Jul 21, 2025

Merge branch 'upstream_main' into v1_encoder_only

a416120

Signed-off-by: Max de Bayser <[email protected]>

mergify bot removed the needs-rebase label Jul 21, 2025

russellb previously requested changes Jul 21, 2025

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

maxdebayser added 2 commits July 21, 2025 14:59

Merge branch 'upstream_main' into v1_encoder_only

d845e22

address review comments

1f3fcc4

Signed-off-by: Max de Bayser <[email protected]>

russellb reviewed Jul 21, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

russellb mentioned this pull request Jul 21, 2025

v1: Add Whisper model support (encoder-decoder) #21088

Draft

1 task

russellb added this to the v0.10.0 milestone Jul 22, 2025

Merge branch 'upstream_main' into v1_encoder_only

85bf5fe

maxdebayser added 3 commits July 21, 2025 22:10

remove sliding window attention case

8e2cba1

Signed-off-by: Max de Bayser <[email protected]>

address review comment

7357614

Signed-off-by: Max de Bayser <[email protected]>

make causal a flag in common attention metadata

aa69e92

Signed-off-by: Max de Bayser <[email protected]>

mergify bot added the speculative-decoding label Jul 22, 2025

russellb reviewed Jul 22, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

maxdebayser added 3 commits July 23, 2025 11:10

Merge branch 'upstream_main' into v1_encoder_only

838567f

Signed-off-by: Max de Bayser <[email protected]>

fix typo

d81e143

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_encoder_only

f0caa0b

Signed-off-by: Max de Bayser <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Jul 23, 2025

maxdebayser mentioned this pull request Jul 23, 2025

Add support for token_type_ids #19988

Open

russellb added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 23, 2025

maxdebayser added 6 commits July 24, 2025 15:19

Merge branch 'upstream_main' into v1_encoder_only

9318e98

remove encoder model from unsupported test

068697b

Signed-off-by: Max de Bayser <[email protected]>

fix apply_model tests

837e51b

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_encoder_only

bec5419

remove quant code

6f64b11

Signed-off-by: Max de Bayser <[email protected]>

WoosukKwon approved these changes Jul 25, 2025

View reviewed changes

maxdebayser added 3 commits July 25, 2025 17:33

address review comment

e39bc74

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_encoder_only

b406896

Signed-off-by: Max de Bayser <[email protected]>

trigger ci

a0b0868

Signed-off-by: Max de Bayser <[email protected]>

DarkLight1337 merged commit 1cd6eab into vllm-project:main Jul 26, 2025
70 checks passed

noooop mentioned this pull request Jul 28, 2025

[Model][CI] Let more pooling models support v1 #21747

Merged

4 tasks

maxdebayser mentioned this pull request Jul 29, 2025

[RFC]: Optimize embedding task #21796

Open

1 task

noooop mentioned this pull request Jul 30, 2025

[Do not Merge] This pr might have triggered a CI bug #21929

Open

4 tasks

maxdebayser mentioned this pull request Jul 31, 2025

Support token_type_ids in V1 with less code changes #21985

Open

Uh oh!

Support encoder-only models without KV-Cache #21270

Support encoder-only models without KV-Cache #21270

Uh oh!

Conversation

maxdebayser commented Jul 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

maxdebayser commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WoosukKwon Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxdebayser commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!

maxdebayser commented Jul 20, 2025 •

edited by github-actions bot

Loading

maxdebayser commented Jul 21, 2025 •

edited

Loading

DarkLight1337 commented Jul 21, 2025 •

edited

Loading