-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check #27238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check #27238
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a bug where num_gpu_blocks_override
was ignored during memory checks, which could lead to engine initialization with insufficient KV blocks. The changes in check_enough_kv_cache_memory
are logical and now correctly validate the override value and use it to cap the effective available memory. The new unit tests are comprehensive, covering both the basic case and a more complex heterogeneous model scenario, ensuring the fix is robust.
I have one high-severity comment regarding a latent bug due to a function call with side effects within the modified code block. While it doesn't cause an issue in the current execution path, it's a potential source of future bugs and should be addressed for better code maintainability and correctness.
estimated_max_len = estimate_max_model_len( | ||
vllm_config, kv_cache_spec, available_memory | ||
vllm_config, kv_cache_spec, effective_available_memory | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function estimate_max_model_len
modifies the vllm_config.model_config.max_model_len
attribute as a side effect of its binary search implementation. While this is not currently causing a bug because this code path always raises an exception, it is a latent bug that could cause issues in the future if this function is called in a context that doesn't terminate.
A function should not have hidden side effects on its arguments. It would be best to refactor estimate_max_model_len
to not modify vllm_config
, for example by restoring the original value before returning or by working on a copy.
Since the definition of estimate_max_model_len
is not in this diff, I'm pointing this out here at the call site. A fix could look like this inside estimate_max_model_len
:
def estimate_max_model_len(...):
original_max_len = vllm_config.model_config.max_model_len
try:
# ... existing logic ...
return result
finally:
vllm_config.model_config.max_model_len = original_max_len
53e6b20
to
2aad22e
Compare
See PR body for details. Signed-off-by: khaled-wsa <[email protected]>
Signed-off-by: khaled-wsa <[email protected]>
Hi @khaled-wsa, please see discussion under PR #26939. In addition to the |
Summary
check_enough_kv_cache_memory
ignorednum_gpu_blocks_override
, allowing engine initialization with an insufficient number of KV blocks formax_model_len
.num_gpu_blocks_override
is too small (e.g., 1), initialization raises a clear error even if rawavailable_memory
is large.Context
check_enough_kv_cache_memory
didn't considernum_gpu_blocks_override
#27181.vllm serve facebook/opt-125m --num_gpu_blocks_override=1
appears to start, but no request with length > block size can be scheduled. Expect early failure during initialization.Technical Details
check_enough_kv_cache_memory
:ceil(spec.max_memory_usage_bytes / spec.page_size_bytes)
for each layer and ensurenum_gpu_blocks_override >= max(required_blocks_per_layer)
. This closes the hole for heterogeneous specs (e.g., cross-attn vs self-attn).available_memory
bysum(page_size_bytes) * num_gpu_blocks_override
to formeffective_available_memory
.needed_memory
toeffective_available_memory
, and pass the effective capacity toestimate_max_model_len
for accurate guidance.Files Changed
num_gpu_blocks_override
and apply memory cap. Adjust error message.test_check_enough_kv_cache_memory_respects_num_gpu_blocks_override
.test_override_must_cover_worst_layer_blocks_in_heterogeneous_model
to cover cross-attn vs self-attn scenario.How To Test
pytest -q tests/v1/core/test_kv_cache_utils.py::test_check_enough_kv_cache_memory_respects_num_gpu_blocks_override
--num_gpu_blocks_override=1
.Notes
page_size_bytes
.