[CORE] Support Prefix Caching with Prompt Embeds #27219

qthequartermasterman · 2025-10-20T19:41:10Z

Purpose

Fixes #25096. Follow up to #24278. Enable prefix caching with Prompt Embeds.

This PR supersedes #25741 which I let fall too far behind main.

Test Plan

Added new Unit Tests. Tested with some local scripts. I saw DRAMATIC speed up now with prefix caching enabled vs disabled. Both cases with prompt embeds enabled. (3178 tok/s on Llama 3.2-1B on an A100 without prefix caching versus 26164.79 toks/s on the same system with prefix caching).

Test Result

New tests pass. Pending CI.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Andrew Sansom <[email protected]>

…ompt embeds to extra kv cache keys) Signed-off-by: Andrew Sansom <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]>

Copilot

Pull Request Overview

This PR enables prefix caching support for prompt embeddings in the V1 engine by incorporating prompt embed data into the block hash computation. Previously, prefix caching was disabled when prompt embeds were enabled.

Adds a tensor_data() utility function to extract raw tensor data for serialization and hashing
Integrates prompt embeddings into the block hash generation process
Removes the restriction that disabled prefix caching when prompt embeds were enabled

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
vllm/v1/utils.py	Adds `tensor_data()` utility function for extracting tensor data as memoryview
vllm/v1/serial_utils.py	Refactors to use new `tensor_data()` function instead of inline numpy conversion
vllm/v1/core/kv_cache_utils.py	Adds `_gen_prompt_embeds_extra_hash_keys()` and integrates prompt embed hashing into block hash computation
vllm/engine/arg_utils.py	Removes warning and restriction that disabled prefix caching with prompt embeds
tests/v1/core/test_kv_cache_utils.py	Adds comprehensive tests for prompt embed block hashing scenarios
docs/features/README.md	Updates feature compatibility matrix to show prefix caching now works with prompt embeds

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

mergify · 2025-10-20T19:41:51Z

Documentation preview: https://vllm--27219.org.readthedocs.build/en/27219/

qthequartermasterman · 2025-10-20T19:43:22Z

@heheda12345 This PR is a reimplementation of #25741 using your recommendation of extra keys instead of adding it to the main tuple to be hashed.

gemini-code-assist

Code Review

This pull request enables prefix caching with prompt embeddings, which is a great performance enhancement. The implementation correctly incorporates prompt embeddings into the block hashes for prefix caching. The new tests cover the functionality well. I've found a small performance improvement opportunity by avoiding an unnecessary data copy during hashing, and I've provided suggestions for the implementation and the corresponding tests.

vllm/v1/core/kv_cache_utils.py

tests/v1/core/test_kv_cache_utils.py

qthequartermasterman · 2025-10-20T19:49:53Z

@DarkLight1337 also pinging you since you had previously reviewed the prior PR.

qthequartermasterman · 2025-10-22T15:49:52Z

@DarkLight1337 @heheda12345 would it be possible to enable CI please on this PR pending review?

mergify · 2025-10-22T18:19:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @qthequartermasterman.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Andrew Sansom <[email protected]>

heheda12345

LGTM! Thank you very much.

Signed-off-by: Andrew Sansom <[email protected]>

Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>

…o step_forward * 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits) [Model] Add MoE support for NemotronH (vllm-project#25863) [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245) [CI] Reorganize entrypoints tests (vllm-project#27403) add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525) [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388) [Bugfix] Fix args settings for guided decoding args (vllm-project#27375) [CI/Build] Fix Prithvi plugin test (vllm-project#27393) [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372) [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378) [V1][spec decode] return logprobs for spec decoding (vllm-project#26060) [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219) [Bugfix][Core] running queue index leakage exception (vllm-project#26754) [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133) [Bugfix] Fix SLA tuner initialization (vllm-project#27355) [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361) [MLA] Bump FlashMLA (vllm-project#27354) [Chore] Separate out system utilities from vllm.utils (vllm-project#27201) [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128) [Feature] publisher default set zmq in kv_event config (vllm-project#26915) [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211) ...

qthequartermasterman added 5 commits October 20, 2025 12:28

feat: remove guard that prevents prefix caching + v1 prompt embeds

c1098b6

Signed-off-by: Andrew Sansom <[email protected]>

refactor: extract tensor_data helper function

ef45671

Signed-off-by: Andrew Sansom <[email protected]>

refactor: use new tensor_data helper function in _encode_tensor

778617d

Signed-off-by: Andrew Sansom <[email protected]>

feat: support prompt embeds + automatic prefix caching (via adding pr…

b2c9502

…ompt embeds to extra kv cache keys) Signed-off-by: Andrew Sansom <[email protected]>

docs: Document prefix caching + prompt embeds compatibility

ddac8da

Signed-off-by: Andrew Sansom <[email protected]>

Copilot AI review requested due to automatic review settings October 20, 2025 19:41

qthequartermasterman requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners October 20, 2025 19:41

Copilot AI reviewed Oct 20, 2025

View reviewed changes

mergify bot added documentation Improvements or additions to documentation v1 labels Oct 20, 2025

qthequartermasterman mentioned this pull request Oct 20, 2025

feat: Support Prefix Caching with Prompt Embeds #25741

Closed

5 tasks

gemini-code-assist bot reviewed Oct 20, 2025

View reviewed changes

vllm/v1/core/kv_cache_utils.py Show resolved Hide resolved

tests/v1/core/test_kv_cache_utils.py Show resolved Hide resolved

tests/v1/core/test_kv_cache_utils.py Show resolved Hide resolved

tests/v1/core/test_kv_cache_utils.py Show resolved Hide resolved

Merge branch 'main' into prefix-caching-v1-engine

72d3914

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 22, 2025

mergify bot added the needs-rebase label Oct 22, 2025

Merge branch 'main' into prefix-caching-v1-engine

8cd1aba

Signed-off-by: Andrew Sansom <[email protected]>

mergify bot removed the needs-rebase label Oct 22, 2025

heheda12345 approved these changes Oct 22, 2025

View reviewed changes

Merge branch 'main' into prefix-caching-v1-engine

8c3e2cc

vllm-bot merged commit ff93cc8 into vllm-project:main Oct 23, 2025
47 of 49 checks passed

Kay-Tian mentioned this pull request Oct 23, 2025

vLLM PR #27219 变更核心文件提醒 Kay-Tian/vllm#13

Closed

usberkeley pushed a commit to usberkeley/vllm that referenced this pull request Oct 23, 2025

[CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219)

66990a3

Signed-off-by: Andrew Sansom <[email protected]>

albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 23, 2025

[CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219)

5b487b7

Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CORE] Support Prefix Caching with Prompt Embeds #27219

[CORE] Support Prefix Caching with Prompt Embeds #27219

qthequartermasterman commented Oct 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

mergify bot commented Oct 20, 2025

Uh oh!

qthequartermasterman commented Oct 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qthequartermasterman commented Oct 20, 2025

Uh oh!

qthequartermasterman commented Oct 22, 2025

Uh oh!

mergify bot commented Oct 22, 2025

Uh oh!

heheda12345 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[CORE] Support Prefix Caching with Prompt Embeds #27219

[CORE] Support Prefix Caching with Prompt Embeds #27219

Conversation

qthequartermasterman commented Oct 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

mergify bot commented Oct 20, 2025

Uh oh!

qthequartermasterman commented Oct 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qthequartermasterman commented Oct 20, 2025

Uh oh!

qthequartermasterman commented Oct 22, 2025

Uh oh!

mergify bot commented Oct 22, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qthequartermasterman commented Oct 20, 2025 •

edited by github-actions bot

Loading