Skip to content

Conversation

@qthequartermasterman
Copy link
Contributor

@qthequartermasterman qthequartermasterman commented Sep 26, 2025

Purpose

Fixes #25096. Follow up to #24278. Enable prefix caching with Prompt Embeds.

Test Plan

Added new Unit Tests. Tested with some local scripts. I saw DRAMATIC speed up now with prefix caching enabled vs disabled. Both cases with prompt embeds enabled.

Test Result

New tests pass. Pending CI.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables prefix caching for requests with prompt embeddings. The changes involve updating the block hashing mechanism to include serialized prompt embeddings. The logic is primarily implemented in vllm/v1/core/kv_cache_utils.py by modifying hash_block_tokens and its callers. The previous restriction in vllm/engine/arg_utils.py that disabled this combination has been correctly removed. Comprehensive unit tests have been added to validate the new functionality under various conditions, including with and without multimodal inputs. My main feedback is a performance consideration in the tensor serialization logic.

@qthequartermasterman
Copy link
Contributor Author

qthequartermasterman commented Sep 26, 2025

@DarkLight1337 Sorry to hit you with so many PRs at once. I've been working on a few little fixes as I'm continuing to use Prompt Embeds. Finally got the time this evening to publish them.

This one will have a conflict with #25717, because they both modify the same line in the compatibility matrix. I'll update this one after the other one lands.

@qthequartermasterman
Copy link
Contributor Author

@DarkLight1337 is there anyone else who should review this one?

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 30, 2025 02:55
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 30, 2025
@DarkLight1337
Copy link
Member

Can you merge with main to see if the CI is fixed?

auto-merge was automatically disabled October 2, 2025 01:57

Head branch was pushed to by a user without write access

@DarkLight1337
Copy link
Member

Sorry can you merge again? Seems there is some issue with CI not starting

@DarkLight1337
Copy link
Member

Can you merge from main once again?

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 2, 2025 16:59
@DarkLight1337
Copy link
Member

PTAL at the failing test

auto-merge was automatically disabled October 3, 2025 15:19

Head branch was pushed to by a user without write access

@qthequartermasterman
Copy link
Contributor Author

qthequartermasterman commented Oct 3, 2025

@DarkLight1337 My mistake. I forgot to refactor some of the tests when extracting out that tensor data function. Sorry!

That test file is now passing locally, but I haven't run the whole suite locally. Hopefully nothing else is broken. :(

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 3, 2025 15:28
auto-merge was automatically disabled October 3, 2025 17:14

Head branch was pushed to by a user without write access


# Compute the hash of the current block
block_tokens = request.all_token_ids[start_token_idx:end_token_idx]
block_prompt_embeds = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to put them into generate_block_hash_extra_keys and put the prompt embedding hash to extra_key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable. I may have to get to it on Monday, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, any update on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I let this PR get too stale and there are so many merge conflicts it was easier to just start over with a new PR. See here: #27219

Sorry to neglect this one.

@mergify
Copy link

mergify bot commented Oct 8, 2025

Documentation preview: https://vllm--25741.org.readthedocs.build/en/25741/

@mergify
Copy link

mergify bot commented Oct 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @qthequartermasterman.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation needs-rebase ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Prefix Caching support when Prompt Embeds is enabled.

3 participants