Skip to content

Conversation

@hl475
Copy link
Contributor

@hl475 hl475 commented Nov 12, 2025

Purpose

After #24794, encoder-only models (e.g., BERT) fail to initialize because the TRITON_ATTN backend is selected by default, but it doesn't support encoder self-attention, causing:

NotImplementedError: Encoder self-attention and encoder/decoder cross-attention are not implemented for TritonAttentionImpl

This PR implemented an opt-in approach for attention type support:

  1. Added supports_attn_type() method to AttentionBackend:
    - Default behavior: Only supports DECODER attention
    - Backends must explicitly override to support ENCODER_ONLY or other attention types
    - This makes the system safe by default - new backends won't accidentally support encoder-only models
  2. Propagated attn_type through the backend selection pipeline:
    - Added attn_type parameter to get_attn_backend() and validate_configuration()
    - Modified EncoderOnlyAttention to pass attn_type=AttentionType.ENCODER_ONLY
    - Platform classes now validate attention type compatibility during backend selection
  3. Explicitly marked the 3 backends that support encoder-only models:
    - FlexAttention: Supports DECODER + ENCODER_ONLY
    - FlashAttention: Supports DECODER + ENCODER_ONLY
    - CPU/TorchSDPA: Supports all attention types

Test Plan

pytest -s -v tests/models/language/pooling/test_token_classification.py::test_bert_models[float-boltuix/NeuroBERT-NER]

Test Result

1 passed, 4 warnings in 18.16s

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@hl475 hl475 force-pushed the fix_encoder_only_models branch 2 times, most recently from c3c5c39 to 94f2d04 Compare November 12, 2025 09:07
@mergify mergify bot added rocm Related to AMD ROCm tpu Related to Google TPUs labels Nov 12, 2025
@hl475 hl475 changed the title [WIP] fix_encoder_only_models [CI Failure] Fix backend selection for encoder-only models Nov 12, 2025
@hl475 hl475 marked this pull request as ready for review November 12, 2025 09:17
@DarkLight1337
Copy link
Member

cc @MatthewBonanni @LucasWilkinson

Copy link
Contributor

@MatthewBonanni MatthewBonanni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down the road I'd like to make AttentionType an enum, but this LGTM!

@mergify
Copy link

mergify bot commented Nov 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hl475.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 12, 2025
@MatthewBonanni
Copy link
Contributor

MatthewBonanni commented Nov 12, 2025

Which backends actually do support encoder self-attention? Want to make sure this doesn't just kick over to another backend that doesn't support it and continue failing the tests. Please also make sure to run the previously-failing CI tests if they aren't triggered automatically

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good; thanks for fixing this! please rebase

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 12, 2025
@hl475 hl475 force-pushed the fix_encoder_only_models branch from 799e129 to b6832ad Compare November 12, 2025 18:12
@hl475
Copy link
Contributor Author

hl475 commented Nov 12, 2025

rebase

@mergify mergify bot removed the needs-rebase label Nov 12, 2025
@hl475
Copy link
Contributor Author

hl475 commented Nov 12, 2025

Thanks @LucasWilkinson and @MatthewBonanni for reviewing!

Just rebased my PR. Could you folks please help add the ready label (I don't have the permission) so I can run all previous failing CIs, thanks!

@hl475
Copy link
Contributor Author

hl475 commented Nov 12, 2025

Which backends actually do support encoder self-attention? Want to make sure this doesn't just kick over to another backend that doesn't support it and continue failing the tests.

DONE!I will check this and maybe come up with some additional changes!

@hl475 hl475 force-pushed the fix_encoder_only_models branch from b6832ad to 028d538 Compare November 12, 2025 19:12
@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed embedding labels Nov 12, 2025
@mgoin
Copy link
Member

mgoin commented Nov 12, 2025

I think we can probably ignore my comments for now, but we should consider them in followup. Probably Matt can tackle that if you don't have time

@hl475 hl475 force-pushed the fix_encoder_only_models branch from 028d538 to 939862f Compare November 12, 2025 21:23
@hl475
Copy link
Contributor Author

hl475 commented Nov 12, 2025

I think we can probably ignore my comments for now, but we should consider them in followup. Probably Matt can tackle that if you don't have time

oops, sorry just saw your comment @mgoin - but I changed PR based on your comments (thanks)!

I am OK either way. Please let me know and then I can start to run previous failing CIs!

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the attention backends that support running ENCODER and ENCODER_DECODER? I don't see them mentioned anywhere cc @russellb @NickLucche

@hl475
Copy link
Contributor Author

hl475 commented Nov 12, 2025

What are the attention backends that support running ENCODER and ENCODER_DECODER?

Regarding this, not sure if I get you correctly but this PR focuses on fixing ENCODER_ONLY model support. I will defer this questions to others. From my understanding,

  • AttentionType.ENCODER - CPU supports it; FlashAttention supported it via _forward_encoder_attention; FlexAttention does not support it
  • AttentionType.ENCODER_DECODER - None of the v1 backends currently support this ?

@hl475 hl475 force-pushed the fix_encoder_only_models branch 2 times, most recently from 972a6ae to fe7f580 Compare November 12, 2025 23:49
@russellb
Copy link
Member

  • AttentionType.ENCODER_DECODER - None of the v1 backends currently support this ?

flash_attn supports ENCODER_DECODER.

flashinfer would support it with this change: #25098

@hl475 hl475 force-pushed the fix_encoder_only_models branch from fe7f580 to 0c767bd Compare November 13, 2025 01:12
Signed-off-by: Huamin Li <[email protected]>
@hl475 hl475 force-pushed the fix_encoder_only_models branch from 3f5e3f6 to b27f2ca Compare November 13, 2025 08:53
@mgoin mgoin merged commit 07a606a into vllm-project:main Nov 13, 2025
53 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 13, 2025
@hl475
Copy link
Contributor Author

hl475 commented Nov 13, 2025

I manually triggered all 3 previously failing CIs due to NotImplementedError: Encoder self-attention and encoder/decoder cross-attention are not implemented for TritonAttentionImpl in this PR

Language Models Test (Extended Pooling) - https://buildkite.com/vllm/ci/builds/38812/steps/canvas?jid=019a7c6c-315c-418a-8273-e5b946fbac0f
Language Models Test (MTEB) - https://buildkite.com/vllm/ci/builds/38812/steps/canvas?jid=019a7c6c-315f-4f59-8fb6-bbc83f129594
Multi-Modal Models Test (Extended) 1 - https://buildkite.com/vllm/ci/builds/38812/steps/canvas?jid=019a7c6c-3166-424b-a9af-aa70e5fadf08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

embedding nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm tpu Related to Google TPUs v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants