Skip to content

Conversation

@qthequartermasterman
Copy link
Contributor

@qthequartermasterman qthequartermasterman commented Oct 20, 2025

Purpose

Fixes #25096. Follow up to #24278. Enable prefix caching with Prompt Embeds.

This PR supersedes #25741 which I let fall too far behind main.

Test Plan

Added new Unit Tests. Tested with some local scripts. I saw DRAMATIC speed up now with prefix caching enabled vs disabled. Both cases with prompt embeds enabled. (3178 tok/s on Llama 3.2-1B on an A100 without prefix caching versus 26164.79 toks/s on the same system with prefix caching).

Test Result

New tests pass. Pending CI.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables prefix caching support for prompt embeddings in the V1 engine by incorporating prompt embed data into the block hash computation. Previously, prefix caching was disabled when prompt embeds were enabled.

  • Adds a tensor_data() utility function to extract raw tensor data for serialization and hashing
  • Integrates prompt embeddings into the block hash generation process
  • Removes the restriction that disabled prefix caching when prompt embeds were enabled

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
vllm/v1/utils.py Adds tensor_data() utility function for extracting tensor data as memoryview
vllm/v1/serial_utils.py Refactors to use new tensor_data() function instead of inline numpy conversion
vllm/v1/core/kv_cache_utils.py Adds _gen_prompt_embeds_extra_hash_keys() and integrates prompt embed hashing into block hash computation
vllm/engine/arg_utils.py Removes warning and restriction that disabled prefix caching with prompt embeds
tests/v1/core/test_kv_cache_utils.py Adds comprehensive tests for prompt embed block hashing scenarios
docs/features/README.md Updates feature compatibility matrix to show prefix caching now works with prompt embeds

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@mergify
Copy link

mergify bot commented Oct 20, 2025

Documentation preview: https://vllm--27219.org.readthedocs.build/en/27219/

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Oct 20, 2025
@qthequartermasterman
Copy link
Contributor Author

@heheda12345 This PR is a reimplementation of #25741 using your recommendation of extra keys instead of adding it to the main tuple to be hashed.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables prefix caching with prompt embeddings, which is a great performance enhancement. The implementation correctly incorporates prompt embeddings into the block hashes for prefix caching. The new tests cover the functionality well. I've found a small performance improvement opportunity by avoiding an unnecessary data copy during hashing, and I've provided suggestions for the implementation and the corresponding tests.

@qthequartermasterman
Copy link
Contributor Author

@DarkLight1337 also pinging you since you had previously reviewed the prior PR.

@qthequartermasterman
Copy link
Contributor Author

@DarkLight1337 @heheda12345 would it be possible to enable CI please on this PR pending review?

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 22, 2025
@mergify
Copy link

mergify bot commented Oct 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @qthequartermasterman.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 22, 2025
@mergify mergify bot removed the needs-rebase label Oct 22, 2025
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you very much.

@vllm-bot vllm-bot merged commit ff93cc8 into vllm-project:main Oct 23, 2025
47 of 49 checks passed
usberkeley pushed a commit to usberkeley/vllm that referenced this pull request Oct 23, 2025
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 23, 2025
845473182 pushed a commit to raindaywhu/vllm that referenced this pull request Oct 24, 2025
…o step_forward

* 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits)
  [Model] Add MoE support for NemotronH (vllm-project#25863)
  [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245)
  [CI] Reorganize entrypoints tests (vllm-project#27403)
  add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525)
  [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388)
  [Bugfix] Fix args settings for guided decoding args (vllm-project#27375)
  [CI/Build] Fix Prithvi plugin test (vllm-project#27393)
  [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372)
  [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378)
  [V1][spec decode] return logprobs for spec decoding (vllm-project#26060)
  [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219)
  [Bugfix][Core] running queue index leakage exception (vllm-project#26754)
  [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133)
  [Bugfix] Fix SLA tuner initialization (vllm-project#27355)
  [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361)
  [MLA] Bump FlashMLA (vllm-project#27354)
  [Chore] Separate out system utilities from vllm.utils (vllm-project#27201)
  [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128)
  [Feature] publisher default set zmq in kv_event config (vllm-project#26915)
  [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Prefix Caching support when Prompt Embeds is enabled.

4 participants