-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[CORE] Support Prefix Caching with Prompt Embeds #27219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE] Support Prefix Caching with Prompt Embeds #27219
Conversation
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
…ompt embeds to extra kv cache keys) Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enables prefix caching support for prompt embeddings in the V1 engine by incorporating prompt embed data into the block hash computation. Previously, prefix caching was disabled when prompt embeds were enabled.
- Adds a
tensor_data()utility function to extract raw tensor data for serialization and hashing - Integrates prompt embeddings into the block hash generation process
- Removes the restriction that disabled prefix caching when prompt embeds were enabled
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| vllm/v1/utils.py | Adds tensor_data() utility function for extracting tensor data as memoryview |
| vllm/v1/serial_utils.py | Refactors to use new tensor_data() function instead of inline numpy conversion |
| vllm/v1/core/kv_cache_utils.py | Adds _gen_prompt_embeds_extra_hash_keys() and integrates prompt embed hashing into block hash computation |
| vllm/engine/arg_utils.py | Removes warning and restriction that disabled prefix caching with prompt embeds |
| tests/v1/core/test_kv_cache_utils.py | Adds comprehensive tests for prompt embed block hashing scenarios |
| docs/features/README.md | Updates feature compatibility matrix to show prefix caching now works with prompt embeds |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
Documentation preview: https://vllm--27219.org.readthedocs.build/en/27219/ |
|
@heheda12345 This PR is a reimplementation of #25741 using your recommendation of extra keys instead of adding it to the main tuple to be hashed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables prefix caching with prompt embeddings, which is a great performance enhancement. The implementation correctly incorporates prompt embeddings into the block hashes for prefix caching. The new tests cover the functionality well. I've found a small performance improvement opportunity by avoiding an unnecessary data copy during hashing, and I've provided suggestions for the implementation and the corresponding tests.
|
@DarkLight1337 also pinging you since you had previously reviewed the prior PR. |
|
@DarkLight1337 @heheda12345 would it be possible to enable CI please on this PR pending review? |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Andrew Sansom <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you very much.
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
…o step_forward * 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits) [Model] Add MoE support for NemotronH (vllm-project#25863) [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245) [CI] Reorganize entrypoints tests (vllm-project#27403) add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525) [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388) [Bugfix] Fix args settings for guided decoding args (vllm-project#27375) [CI/Build] Fix Prithvi plugin test (vllm-project#27393) [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372) [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378) [V1][spec decode] return logprobs for spec decoding (vllm-project#26060) [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219) [Bugfix][Core] running queue index leakage exception (vllm-project#26754) [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133) [Bugfix] Fix SLA tuner initialization (vllm-project#27355) [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361) [MLA] Bump FlashMLA (vllm-project#27354) [Chore] Separate out system utilities from vllm.utils (vllm-project#27201) [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128) [Feature] publisher default set zmq in kv_event config (vllm-project#26915) [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211) ...
Purpose
Fixes #25096. Follow up to #24278. Enable prefix caching with Prompt Embeds.
This PR supersedes #25741 which I let fall too far behind main.
Test Plan
Added new Unit Tests. Tested with some local scripts. I saw DRAMATIC speed up now with prefix caching enabled vs disabled. Both cases with prompt embeds enabled. (3178 tok/s on Llama 3.2-1B on an A100 without prefix caching versus 26164.79 toks/s on the same system with prefix caching).
Test Result
New tests pass. Pending CI.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.