-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
feat: Support Prefix Caching with Prompt Embeds #25741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support Prefix Caching with Prompt Embeds #25741
Conversation
Signed-off-by: Andrew Sansom <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables prefix caching for requests with prompt embeddings. The changes involve updating the block hashing mechanism to include serialized prompt embeddings. The logic is primarily implemented in vllm/v1/core/kv_cache_utils.py by modifying hash_block_tokens and its callers. The previous restriction in vllm/engine/arg_utils.py that disabled this combination has been correctly removed. Comprehensive unit tests have been added to validate the new functionality under various conditions, including with and without multimodal inputs. My main feedback is a performance consideration in the tensor serialization logic.
|
@DarkLight1337 Sorry to hit you with so many PRs at once. I've been working on a few little fixes as I'm continuing to use Prompt Embeds. Finally got the time this evening to publish them. This one will have a conflict with #25717, because they both modify the same line in the compatibility matrix. I'll update this one after the other one lands. |
|
@DarkLight1337 is there anyone else who should review this one? |
Signed-off-by: Andrew Sansom <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
|
Can you merge with main to see if the CI is fixed? |
Signed-off-by: Andrew Sansom <[email protected]>
Head branch was pushed to by a user without write access
|
Sorry can you merge again? Seems there is some issue with CI not starting |
|
Can you merge from main once again? |
|
PTAL at the failing test |
Signed-off-by: Andrew Sansom <[email protected]>
Head branch was pushed to by a user without write access
|
@DarkLight1337 My mistake. I forgot to refactor some of the tests when extracting out that tensor data function. Sorry! That test file is now passing locally, but I haven't run the whole suite locally. Hopefully nothing else is broken. :( |
Signed-off-by: Andrew Sansom <[email protected]>
Head branch was pushed to by a user without write access
|
|
||
| # Compute the hash of the current block | ||
| block_tokens = request.all_token_ids[start_token_idx:end_token_idx] | ||
| block_prompt_embeds = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to put them into generate_block_hash_extra_keys and put the prompt embedding hash to extra_key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable. I may have to get to it on Monday, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, any update on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I let this PR get too stale and there are so many merge conflicts it was easier to just start over with a new PR. See here: #27219
Sorry to neglect this one.
|
Documentation preview: https://vllm--25741.org.readthedocs.build/en/25741/ |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Fixes #25096. Follow up to #24278. Enable prefix caching with Prompt Embeds.
Test Plan
Added new Unit Tests. Tested with some local scripts. I saw DRAMATIC speed up now with prefix caching enabled vs disabled. Both cases with prompt embeds enabled.
Test Result
New tests pass. Pending CI.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.