Export state/prefix-cache & reuse #14895

M0rpheus-0 · 2025-03-16T17:12:07Z

M0rpheus-0
Mar 16, 2025

Hello,
I am using vLLM in a python script, and serve my own inference endpoint through flask.
I do this due to some constraints that require some custom logic during inference.

Having used llama.cpp in the past, I could make use of a feature like llm.save_state(), where you basically export the model's hidden state (prefix cache), and reload it when you want to save time on re-ingesting the prefix.

My use-case is one where I have 3 large prompts followed by a small custom instruction at the end.
I would like to keep those 3 prompts cached for efficiency's sake, and reload them as necessary to cut down the ingestion/preload time.

Does vLLM offer this functionality?
If not, is there some way I could implement it?

Thank you all!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export state/prefix-cache & reuse #14895

{{title}}

Replies: 0 comments

Select a reply

Export state/prefix-cache & reuse #14895

M0rpheus-0 Mar 16, 2025

Replies: 0 comments

M0rpheus-0
Mar 16, 2025