How to increase context lenght and make things work #3823
Unanswered
kresimirfijacko
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am struggling with understanding of KV cache handling and limits on concurrent requests.
For example, i am using Qwen--Qwen1.5-72B-Chat-GPTQ-Int4 on H100 80GB instance.
i've tried with vllm v0.3.3 and v0.4.0 and i am facing something i don't quite understand:
i didn't specify --max-num-seqs so it should be default of 256
That is also something that i don't understand - even when i use smaller 7B models, requests are always limited to max 100.
In case of this Qwen72B model, max request size is also limited to 100, and as requests are being processed, kv cache usage goes to 99%, and even though there are pending requests, they will not be added to queue until kv cache drops.
What i don't understand is how to actually increase context length size?
Example:
ValueError: The model's max seq len (16384) is larger than the maximum number of tokens that can be stored in KV cache (11648). Try increasing
gpu_memory_utilization
or decreasingmax_model_len
when initializing the engine.I understand there is a memory limit, but i was thinking by that formula:
2 (K,V) * precision * hidden_layers * hidden_size * seq_len * batch_size
bold is because this is fixed
i was thinking i can manipulate seq_len * batch_size (8192 * 256) into something like (16384 * 128)
so in order to provide bigger context size, i would manipulate with batch_size
so far my experiments didn't give any result
?
Beta Was this translation helpful? Give feedback.
All reactions