v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check #27238

khaled-wsa · 2025-10-21T02:27:00Z

Summary

Fixes bug where check_enough_kv_cache_memory ignored num_gpu_blocks_override, allowing engine initialization with an insufficient number of KV blocks for max_model_len.
Adds a unit test to ensure that when num_gpu_blocks_override is too small (e.g., 1), initialization raises a clear error even if raw available_memory is large.

Context

Reported in [Bug]: check_enough_kv_cache_memory didn't consider num_gpu_blocks_override #27181.
Repro: vllm serve facebook/opt-125m --num_gpu_blocks_override=1 appears to start, but no request with length > block size can be scheduled. Expect early failure during initialization.

Technical Details

In check_enough_kv_cache_memory:
- Validate override against per-layer requirements: compute ceil(spec.max_memory_usage_bytes / spec.page_size_bytes) for each layer and ensure num_gpu_blocks_override >= max(required_blocks_per_layer). This closes the hole for heterogeneous specs (e.g., cross-attn vs self-attn).
- Cap raw available_memory by sum(page_size_bytes) * num_gpu_blocks_override to form effective_available_memory.
- Compare needed_memory to effective_available_memory, and pass the effective capacity to estimate_max_model_len for accurate guidance.
- Improve the error message to explicitly mention when the override constrains effective capacity.

Files Changed

vllm/v1/core/kv_cache_utils.py
- Enforce per-layer minimum blocks for num_gpu_blocks_override and apply memory cap. Adjust error message.
tests/v1/core/test_kv_cache_utils.py
- Add test_check_enough_kv_cache_memory_respects_num_gpu_blocks_override.
- Add test_override_must_cover_worst_layer_blocks_in_heterogeneous_model to cover cross-attn vs self-attn scenario.

How To Test

Unit test (CPU):
- pytest -q tests/v1/core/test_kv_cache_utils.py::test_check_enough_kv_cache_memory_respects_num_gpu_blocks_override
Manual sanity:
- Start the server with a small model and --num_gpu_blocks_override=1.
- Expect initialization to fail with a ValueError that mentions the effective capacity and the override value.

Notes

The change is localized and only alters the pre-initialization capacity check; runtime behavior is unchanged.
Works for both uniform and hybrid KV cache specs since the per-block total uses each layer's page_size_bytes.

github-actions · 2025-10-21T02:27:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request correctly fixes a bug where num_gpu_blocks_override was ignored during memory checks, which could lead to engine initialization with insufficient KV blocks. The changes in check_enough_kv_cache_memory are logical and now correctly validate the override value and use it to cap the effective available memory. The new unit tests are comprehensive, covering both the basic case and a more complex heterogeneous model scenario, ensuring the fix is robust.

I have one high-severity comment regarding a latent bug due to a function call with side effects within the modified code block. While it doesn't cause an issue in the current execution path, it's a potential source of future bugs and should be addressed for better code maintainability and correctness.

gemini-code-assist · 2025-10-21T02:29:17Z

vllm/v1/core/kv_cache_utils.py

        estimated_max_len = estimate_max_model_len(
-            vllm_config, kv_cache_spec, available_memory
+            vllm_config, kv_cache_spec, effective_available_memory
        )


The function estimate_max_model_len modifies the vllm_config.model_config.max_model_len attribute as a side effect of its binary search implementation. While this is not currently causing a bug because this code path always raises an exception, it is a latent bug that could cause issues in the future if this function is called in a context that doesn't terminate.

A function should not have hidden side effects on its arguments. It would be best to refactor estimate_max_model_len to not modify vllm_config, for example by restoring the original value before returning or by working on a copy.

Since the definition of estimate_max_model_len is not in this diff, I'm pointing this out here at the call site. A fix could look like this inside estimate_max_model_len:

def estimate_max_model_len(...): original_max_len = vllm_config.model_config.max_model_len try: # ... existing logic ... return result finally: vllm_config.model_config.max_model_len = original_max_len

See PR body for details. Signed-off-by: khaled-wsa <[email protected]>

Signed-off-by: khaled-wsa <[email protected]>

elaineyz · 2025-10-21T18:29:16Z

Hi @khaled-wsa, please see discussion under PR #26939.

In addition to the num_gpu_blocks_override param, the initialization of a null_block may also reduce the total available memory. Could you factor that part into this PR as well? It will make check_enough_kv_cache_memory more robust.

khaled-wsa requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners October 21, 2025 02:27

mergify bot added the v1 label Oct 21, 2025

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

khaled-wsa force-pushed the fix/kv-cache-check-override branch 3 times, most recently from 53e6b20 to 2aad22e Compare October 21, 2025 02:41

khaled-wsa added 2 commits October 21, 2025 02:41

v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check

fdebacf

See PR body for details. Signed-off-by: khaled-wsa <[email protected]>

chore: drop local stub/tests from PR

2aad22e

Signed-off-by: khaled-wsa <[email protected]>

elaineyz mentioned this pull request Oct 21, 2025

Lazily initialize null_block only when it is needed #26939

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check #27238

v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check #27238

khaled-wsa commented Oct 21, 2025

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

elaineyz commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check #27238

Are you sure you want to change the base?

v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check #27238

Conversation

khaled-wsa commented Oct 21, 2025

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

elaineyz commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants