[CI Failure] Fix backend selection for encoder-only models #28534

hl475 · 2025-11-12T08:52:46Z

Purpose

After #24794, encoder-only models (e.g., BERT) fail to initialize because the TRITON_ATTN backend is selected by default, but it doesn't support encoder self-attention, causing:

NotImplementedError: Encoder self-attention and encoder/decoder cross-attention are not implemented for TritonAttentionImpl

This PR implemented an opt-in approach for attention type support:

Added supports_attn_type() method to AttentionBackend:
- Default behavior: Only supports DECODER attention
- Backends must explicitly override to support ENCODER_ONLY or other attention types
- This makes the system safe by default - new backends won't accidentally support encoder-only models
Propagated attn_type through the backend selection pipeline:
- Added attn_type parameter to get_attn_backend() and validate_configuration()
- Modified EncoderOnlyAttention to pass attn_type=AttentionType.ENCODER_ONLY
- Platform classes now validate attention type compatibility during backend selection
Explicitly marked the 3 backends that support encoder-only models:
- FlexAttention: Supports DECODER + ENCODER_ONLY
- FlashAttention: Supports DECODER + ENCODER_ONLY
- CPU/TorchSDPA: Supports all attention types

Test Plan

pytest -s -v tests/models/language/pooling/test_token_classification.py::test_bert_models[float-boltuix/NeuroBERT-NER]

Test Result

1 passed, 4 warnings in 18.16s

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

DarkLight1337 · 2025-11-12T09:33:07Z

cc @MatthewBonanni @LucasWilkinson

MatthewBonanni

Down the road I'd like to make AttentionType an enum, but this LGTM!

vllm/attention/backends/abstract.py

mergify · 2025-11-12T16:26:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hl475.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

MatthewBonanni · 2025-11-12T16:30:43Z

Which backends actually do support encoder self-attention? Want to make sure this doesn't just kick over to another backend that doesn't support it and continue failing the tests. Please also make sure to run the previously-failing CI tests if they aren't triggered automatically

LucasWilkinson

Overall looks good; thanks for fixing this! please rebase

hl475 · 2025-11-12T18:12:25Z

rebase

hl475 · 2025-11-12T18:17:01Z

Thanks @LucasWilkinson and @MatthewBonanni for reviewing!

Just rebased my PR. Could you folks please help add the ready label (I don't have the permission) so I can run all previous failing CIs, thanks!

hl475 · 2025-11-12T18:40:40Z

Which backends actually do support encoder self-attention? Want to make sure this doesn't just kick over to another backend that doesn't support it and continue failing the tests.

DONE!~~I will check this and maybe come up with some additional changes!~~

vllm/attention/backends/abstract.py

mgoin · 2025-11-12T21:19:46Z

I think we can probably ignore my comments for now, but we should consider them in followup. Probably Matt can tackle that if you don't have time

hl475 · 2025-11-12T21:30:05Z

I think we can probably ignore my comments for now, but we should consider them in followup. Probably Matt can tackle that if you don't have time

oops, sorry just saw your comment @mgoin - but I changed PR based on your comments (thanks)!

I am OK either way. Please let me know and then I can start to run previous failing CIs!

mgoin

What are the attention backends that support running ENCODER and ENCODER_DECODER? I don't see them mentioned anywhere cc @russellb @NickLucche

vllm/v1/attention/backends/cpu_attn.py

hl475 · 2025-11-12T22:41:53Z

What are the attention backends that support running ENCODER and ENCODER_DECODER?

Regarding this, not sure if I get you correctly but this PR focuses on fixing ENCODER_ONLY model support. I will defer this questions to others. From my understanding,

AttentionType.ENCODER - CPU supports it; FlashAttention supported it via _forward_encoder_attention; FlexAttention does not support it
AttentionType.ENCODER_DECODER - None of the v1 backends currently support this ?

russellb · 2025-11-13T00:52:37Z

AttentionType.ENCODER_DECODER - None of the v1 backends currently support this ?

flash_attn supports ENCODER_DECODER.

flashinfer would support it with this change: #25098

Signed-off-by: Huamin Li <[email protected]>

hl475 · 2025-11-13T17:44:41Z

I manually triggered all 3 previously failing CIs due to NotImplementedError: Encoder self-attention and encoder/decoder cross-attention are not implemented for TritonAttentionImpl in this PR

Language Models Test (Extended Pooling) - https://buildkite.com/vllm/ci/builds/38812/steps/canvas?jid=019a7c6c-315c-418a-8273-e5b946fbac0f
Language Models Test (MTEB) - https://buildkite.com/vllm/ci/builds/38812/steps/canvas?jid=019a7c6c-315f-4f59-8fb6-bbc83f129594
Multi-Modal Models Test (Extended) 1 - https://buildkite.com/vllm/ci/builds/38812/steps/canvas?jid=019a7c6c-3166-424b-a9af-aa70e5fadf08

mergify bot added nvidia v1 labels Nov 12, 2025

github-project-automation bot added this to NVIDIA Nov 12, 2025

hl475 force-pushed the fix_encoder_only_models branch 2 times, most recently from c3c5c39 to 94f2d04 Compare November 12, 2025 09:07

mergify bot added rocm Related to AMD ROCm tpu Related to Google TPUs labels Nov 12, 2025

hl475 changed the title ~~[WIP] fix_encoder_only_models~~ [CI Failure] Fix backend selection for encoder-only models Nov 12, 2025

hl475 marked this pull request as ready for review November 12, 2025 09:17

hl475 requested review from LucasWilkinson, NickLucche, WoosukKwon, alexm-redhat, bigPYJ1151, comaniac, jikunshang, njhill, tdoublep, tjtanaa, youkaichao and zhuohan123 as code owners November 12, 2025 09:17

MatthewBonanni approved these changes Nov 12, 2025

View reviewed changes

vllm/attention/backends/abstract.py Show resolved Hide resolved

mergify bot added the needs-rebase label Nov 12, 2025

LucasWilkinson approved these changes Nov 12, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Nov 12, 2025

hl475 force-pushed the fix_encoder_only_models branch from 799e129 to b6832ad Compare November 12, 2025 18:12

mergify bot removed the needs-rebase label Nov 12, 2025

hl475 force-pushed the fix_encoder_only_models branch from b6832ad to 028d538 Compare November 12, 2025 19:12

hl475 requested review from gshtras, mgoin and pavanimajety as code owners November 12, 2025 19:12

mgoin added ready ONLY add when PR is ready to merge/full CI is needed embedding labels Nov 12, 2025

mgoin reviewed Nov 12, 2025

View reviewed changes

vllm/attention/backends/abstract.py Outdated Show resolved Hide resolved

vllm/attention/backends/abstract.py Outdated Show resolved Hide resolved

hl475 force-pushed the fix_encoder_only_models branch from 028d538 to 939862f Compare November 12, 2025 21:23

mgoin reviewed Nov 12, 2025

View reviewed changes

vllm/v1/attention/backends/cpu_attn.py Outdated Show resolved Hide resolved

hl475 force-pushed the fix_encoder_only_models branch 2 times, most recently from 972a6ae to fe7f580 Compare November 12, 2025 23:49

hl475 force-pushed the fix_encoder_only_models branch from fe7f580 to 0c767bd Compare November 13, 2025 01:12

mgoin approved these changes Nov 13, 2025

View reviewed changes

comments

b27f2ca

Signed-off-by: Huamin Li <[email protected]>

hl475 force-pushed the fix_encoder_only_models branch from 3f5e3f6 to b27f2ca Compare November 13, 2025 08:53

mgoin merged commit 07a606a into vllm-project:main Nov 13, 2025
53 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Nov 13, 2025

kyuyeunk mentioned this pull request Nov 14, 2025

[Bugfix] Fix attention backend signature vllm-project/tpu-inference#1103

Merged

Uh oh!

[CI Failure] Fix backend selection for encoder-only models #28534

[CI Failure] Fix backend selection for encoder-only models #28534

Uh oh!

Conversation

hl475 commented Nov 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

DarkLight1337 commented Nov 12, 2025

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Nov 12, 2025

Uh oh!

MatthewBonanni commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

hl475 commented Nov 12, 2025

Uh oh!

hl475 commented Nov 12, 2025

Uh oh!

hl475 commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hl475 commented Nov 12, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hl475 commented Nov 12, 2025

Uh oh!

russellb commented Nov 13, 2025

Uh oh!

Uh oh!

hl475 commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hl475 commented Nov 12, 2025 •

edited by github-actions bot

Loading

MatthewBonanni commented Nov 12, 2025 •

edited

Loading

hl475 commented Nov 12, 2025 •

edited

Loading

mgoin commented Nov 12, 2025 •

edited

Loading