Export state/prefix-cache & reuse #14895
Unanswered
M0rpheus-0
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I am using vLLM in a python script, and serve my own inference endpoint through flask.
I do this due to some constraints that require some custom logic during inference.
Having used
llama.cpp
in the past, I could make use of a feature likellm.save_state()
, where you basically export the model's hidden state (prefix cache), and reload it when you want to save time on re-ingesting the prefix.My use-case is one where I have 3 large prompts followed by a small custom instruction at the end.
I would like to keep those 3 prompts cached for efficiency's sake, and reload them as necessary to cut down the ingestion/preload time.
Does vLLM offer this functionality?
If not, is there some way I could implement it?
Thank you all!
Beta Was this translation helpful? Give feedback.
All reactions