Skip to content

Add support for token_type_ids #19988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

maxdebayser
Copy link
Contributor

@maxdebayser maxdebayser commented Jun 23, 2025

This PR is a follow-up to #16188 and #21270. It adds support for models such as cross-encoder/ms-marco-MiniLM-L-6-v2 that require token_type_ids ids to be passed from the tokenizer to the model.

Since passing the token_type_ids up the chain from the entrypoints to the model runner, I'm also exploring other implementation alternatives such as: #20026

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added v1 tpu Related to Google TPUs labels Jun 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @maxdebayser, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's capability by introducing robust support for encoder embedding models and the necessary token_type_ids they often require. I've adapted the attention mechanisms, KV cache management, and core model logic to seamlessly integrate these new model types, ensuring broader compatibility and efficient execution.

Highlights

  • Encoder Model Support: I've added comprehensive support for encoder embedding models, which do not utilize a causal attention mask. This expands the range of models that can be efficiently served by the system.
  • Token Type ID Handling: Full end-to-end support for token_type_ids has been implemented. This is crucial for models like cross-encoder/ms-marco-MiniLM-L-6-v2 that require these IDs to be passed from the tokenizer to the model's forward pass.
  • FlashAttention Integration: The FlashAttention backend (FlashAttentionImpl) now supports AttentionType.ENCODER_ONLY models by dynamically determining causality based on the attention type. This allows encoder models to leverage the performance benefits of FlashAttention.
  • KV Cache Adaptations: The KV cache mechanism has been updated to support encoder-only models for prefix caching. Additionally, I've introduced logic to automatically disable features like prefix caching and chunked prefill when non-decoder (e.g., encoder) attention types are detected, as these optimizations are not applicable or can cause issues with non-causal models.
  • Model Compatibility & Refinements: I've removed the SupportsV0Only interface from BertEmbeddingModel, BertForSequenceClassification, and RobertaForSequenceClassification, signifying their full compatibility with the V1 engine. The RoBERTa embedding layer's position ID handling has also been refined to correctly manage its specific requirements.
  • Temporary CUDA Graph Limitation: For pooling operations with encoder models, CUDA graphs are temporarily disabled and eager mode is enforced due to current compatibility limitations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds support for encoder embedding models and models that require token_type_ids to be passed from the tokenizer to the model. The changes include modifications to the test suite, configuration files, and model execution logic to accommodate these new model types. The code review identified some areas for improvement, including adding more descriptive error messages, making code conditional on model architecture, and simplifying logic.

pytest.param("Alibaba-NLP/gte-Qwen2-1.5B-instruct",
marks=[pytest.mark.skip_v1]),
marks=[pytest.mark.skip_v0]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider removing the skip mark for v0 tests, as it seems the models are now supported in both engines.

Comment on lines 635 to +644
0,
torch.from_numpy(token_indices),
out=self.input_ids_cpu[:total_num_scheduled_tokens])
if self.input_batch.token_type_ids_cpu_tensor is not None:
token_type_ids = torch.index_select(
self.input_batch.token_type_ids_cpu_tensor.flatten(), 0,
torch.from_numpy(token_indices))
# Copy the tensors to the GPU.
self.get_token_type_ids()[:total_num_scheduled_tokens]\
.copy_(token_type_ids, non_blocking=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a check to ensure that self.input_batch.token_type_ids_cpu_tensor is not None before attempting to copy the token type IDs to the GPU. This would prevent a potential error if the model does not use token type IDs.

        if self.input_batch.token_type_ids_cpu_tensor is not None:
            token_type_ids = torch.index_select(
                self.input_batch.token_type_ids_cpu_tensor.flatten(), 0,
                torch.from_numpy(token_indices))
            # Copy the tensors to the GPU.
            self.get_token_type_ids()[:total_num_scheduled_tokens]
                .copy_(token_type_ids, non_blocking=True)

Comment on lines 1317 to 1327
has_token_types = self.token_type_ids is not None
model_kwargs = {}

if self.is_multimodal_model and get_pp_group().is_first_rank:
# NOTE(woosuk): To unify token ids and soft tokens (vision
# embeddings), we always use embeddings (rather than token ids)
# as input to the multimodal model, even when the input is text.
input_ids = self.input_ids[:num_scheduled_tokens]
if has_token_types:
model_kwargs["token_type_ids"] = cast(
torch.Tensor, self.token_type_ids)[:num_scheduled_tokens]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a check to ensure that has_token_types is True before attempting to access self.token_type_ids. This would prevent a potential error if the model does not use token type IDs.

            input_ids = self.input_ids[:num_scheduled_tokens]
            if has_token_types:
                model_kwargs["token_type_ids"] = cast(
                    torch.Tensor, self.token_type_ids)[:num_scheduled_tokens]
            if mm_embeds:

The @supports_torch_compile decorator was applied only to the
encoder stack leaving the embeddings out, which resulted in
a dependency on dynamically allocated tensors.

Signed-off-by: Max de Bayser <[email protected]>
Copy link

mergify bot commented Jul 2, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 2, 2025
@maxdebayser
Copy link
Contributor Author

@DarkLight1337 , can you enable the full CI tests here to see if something is broken while we wait for reviews? Same for #20026

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 8, 2025
Copy link

mergify bot commented Jul 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 18, 2025
@mergify mergify bot removed the needs-rebase label Jul 20, 2025
maxdebayser added a commit to maxdebayser/vllm that referenced this pull request Jul 20, 2025
Add support for encoder models such as BERT which don't support
a KV cache due to the non-causal attention. Since the KV Cache
Spec is used to build the attention metadata for decoder models,
this PR initializes the attention metadata builds for encoder-only
models directly from the layers and adds a function to build the
attention metadata.

This PR combines elements of PRs
vllm-project#21088
and vllm-project#19988

Summary of changes:

**Flash Attention Backend:**
- Implement encoder self-attention support without using KV cache

**Scheduler:**
- Disable chunked prefill for models without KV cache

**GPU Model Runner:**
- Implement encoder-only attention metadata building for self-attention

Related to:
- V0 deprecation: vllm-project#18571
- 2025 Q3 roadmap: vllm-project#20336

Signed-off-by: Max de Bayser <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Copy link

mergify bot commented Jul 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 21, 2025
@maxdebayser
Copy link
Contributor Author

This PR enabled encoder models without removing the KV cache. However, this lead to a few complications because I had to add the attn_type field to a bunch of places to be able to detect whether some features such as chunked prefill should be turned off. Since the opening of this PR, there has been progress in other places that made it possible to remove the KV cache, such ashttps://github.com//pull/20811 and #21088 . Therefore @russellb and I have opened a new PR to that combines the encoder attention from his PR and the bert model support (without token_type_ids) from this one: #21270

@mergify mergify bot added documentation Improvements or additions to documentation and removed needs-rebase labels Jul 25, 2025
Copy link

mergify bot commented Jul 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 26, 2025
@mergify mergify bot removed the needs-rebase label Jul 28, 2025
@maxdebayser maxdebayser changed the title Add support for encoder embedding models Add support for token_type_ids Jul 28, 2025
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this is a bigger change than I expected. Can we somehow avoid adding token_type_ids all over the entire system?

@maxdebayser
Copy link
Contributor Author

@WoosukKwon , I agree, I'm exploring other options such as #20026 . Another idea I had but haven't tried yet is to encode this information in the token ids. Since we use int32 for token ids, but the vocabularies are much smaller than that, we could just use one of the higher bits for the token type ids, which only require 1 bit. It's hacky, but that is the least intrusive way I can think of.

@DarkLight1337
Copy link
Member

I think we can combine these two solutions together. We can add a model_kwargs attribute instead of token_type_ids and unpack those kwargs directly into the model input (using a similar method as multi-modal inputs) without having to register any multi-modal processors. No need to create any buffers inside the model runner as that creates additional complexity

@maxdebayser
Copy link
Contributor Author

The challenge is doing it in a way that works with CUDA graphs because the input tensors must be static and there can't be conditional branches in the model code.

Copy link

mergify bot commented Jul 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants