[BUGFIX] Adjust kv block sizes #27704

vadiklyutiy · 2025-10-29T01:35:10Z

Purpose

Adjust kv_manager_block_size according to minimum supported kernel_block_sizes by attn backed
Temporary remove 16 from supported_kernel_block_size for flashinfer due to TRT-LLM Gen attn has a bug for page_size=16 ([BUG] TRT-LLM Gen full attn. Incorrect result for head_dim=256 flashinfer-ai/flashinfer#1993)

Details

If kv_manager_block_size is None we set the default value 16. Before I tried to remove 16 due to point 2. above every backend support kernel_block_size equal to 16 and everything worked fine. But when no 16 (and less) there is a fail because
kv_manager_block_size % kernel_block_size should be =0. I fixed it by adjusting kv_manager_block_size with get_supported_kernel_block_size().

vadiklyutiy · 2025-10-29T01:36:11Z

CC @zhiyuan1i @heheda12345

gemini-code-assist

Code Review

This pull request addresses two separate issues with KV block sizes. The first change, removing block size 16 for the FlashInfer backend, is a valid workaround for an upstream bug and is well-commented. However, the second change, which rounds up the attention block size for Mamba hybrid models to the next power of two, introduces a critical issue. The resulting block size is not guaranteed to be supported by the attention backend (e.g., FlashInfer only supports [32, 64]). This will cause the block size to be overridden to a smaller, default value, which in turn violates Mamba's memory requirements and will likely lead to runtime failures or memory corruption. I have left a critical comment explaining this flaw in detail.

vllm/model_executor/models/config.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/models/config.py

pavanimajety · 2025-10-30T19:09:04Z

@vadiklyutiy Could we test a mamba model and qwen3 next here with mini tests to see that both paths work?

vadiklyutiy · 2025-10-30T20:20:43Z

@vadiklyutiy Could we test a mamba model and qwen3 next here with mini tests to see that both paths work?

I tested qwen3-next. The accuracy bug with trt-llm gen attn "fixed"

pavanimajety

LGTM, please add the evals.

heheda12345 · 2025-10-30T22:23:14Z

I'm thinking of only hardcode one block_size. #27843

vllm/model_executor/models/config.py

…attn backend instead of rounding to next power of 2 Signed-off-by: Vadim Gimpelson <[email protected]>

vadiklyutiy · 2025-10-31T15:47:10Z

"support of page size 16" seems as more hardcoded than expected.
I think it is worth to additionally review the new changes (after I fix CI)

vadiklyutiy · 2025-10-31T15:51:48Z

"support of page size 16" seems as more hardcoded than expected. I think it is worth to additionally review the new changes (after I fix CI)

Lets me this how we can better fix it

tests/v1/kv_offload/test_cpu_offloading.py

vllm/platforms/cuda.py

heheda12345 · 2025-10-31T22:15:56Z

vllm/v1/attention/backends/flashinfer.py

        # but on Blackwell, only support a page size of
        # 16, 32, 64
-        return [16, 32, 64]
+        # TODO: 16 is temporary removed because TRT-LLM kernel has a bug when using 16.


if the problem only exist in trtllm, what about only remove 16 for trtllm like

if current_platform.is_device_capability(100): return [32, 64] else: return [16, 32, 64]

Agreed, we should only override on Blackwell

vllm/model_executor/layers/config.py

MengqingCao · 2025-11-04T08:01:00Z

vllm/model_executor/layers/config.py

+        # For now enable it for FlashInfer only.
+        # Other backend need debugging.
+        # TODO: enable it for all backends.
+        if backend_cls.get_name() != "FLASHINFER":


Hope this could be done in this pr, I think this is safe to enable for all backends?

a lot of fails in CI. debugging

+1 hope this PR can enable all backends.

vllm/config/vllm.py

zhiyuan1i · 2025-11-04T14:39:09Z

I was wondering if there was a way to achieve reasonable abstraction if this method was easy to miss a certain backend modification. The block size capability of each backend on different hardware is implemented by the maintainer of the backend, and then in the config we call the backend class and obtain the corresponding supported block size

vadiklyutiy · 2025-11-04T15:09:33Z

I was wondering if there was a way to achieve reasonable abstraction if this method was easy to miss a certain backend modification. The block size capability of each backend on different hardware is implemented by the maintainer of the backend, and then in the config we call the backend class and obtain the corresponding supported block size

attn backend has get_supported_kernel_block_size() that provides supported sizes. Here I try to use this to choose right kv manager block size and put this adjustment call on the main path.
So, I didn't fully understand your question...

vllm/config/vllm.py

heheda12345 · 2025-11-04T19:47:39Z

vllm/model_executor/layers/config.py

+        # For now enable it for FlashInfer only.
+        # Other backend need debugging.
+        # TODO: enable it for all backends.
+        if backend_cls.get_name() != "FLASHINFER":


+1 hope this PR can enable all backends.

vllm/config/vllm.py

heheda12345 · 2025-11-04T20:34:13Z

vllm/model_executor/layers/config.py

+        if cache_config.block_size is None:
+            new_block_size = min_size
+        else:
+            new_block_size = lcm(cache_config.block_size, min_size)


Prefer to raise an error if the user sets a block_size but the block_size is not supported by the attention backend it selects.

heheda12345 · 2025-11-04T20:34:45Z

vllm/model_executor/layers/config.py

+        else:
+            new_block_size = lcm(cache_config.block_size, min_size)
+
+        if cache_config.block_size is None or new_block_size != cache_config.block_size:


I think we don't need to add info-level logging if block_size is None and is initialized normally.

heheda12345 · 2025-11-04T20:36:11Z

vllm/model_executor/models/config.py

            ).page_size_bytes
        else:
-            kernel_block_alignment_size = 16
+            if cache_config.block_size is not None:


is this part called before or after AttentionConfig.verify_and_update_config(self)?

I think the kernel_block_alignment_size should be resolved from backend_cls.get_supported_kernel_block_size

vadiklyutiy · 2025-11-05T00:47:58Z

closed by mistake

vadiklyutiy · 2025-11-05T00:48:07Z

closed by mistake

mergify · 2025-11-05T17:55:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Vadim Gimpelson <[email protected]>

…fig call Signed-off-by: Vadim Gimpelson <[email protected]>

Signed-off-by: Vadim Gimpelson <[email protected]>

mergify · 2025-11-11T12:52:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vadiklyutiy requested review from mgoin and pavanimajety as code owners October 29, 2025 01:35

mergify bot added the v1 label Oct 29, 2025

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

vllm/model_executor/models/config.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 29, 2025

View reviewed changes

vllm/model_executor/models/config.py Show resolved Hide resolved

pavanimajety approved these changes Oct 30, 2025

View reviewed changes

pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 30, 2025

vadiklyutiy mentioned this pull request Oct 30, 2025

[Bugfix] Flashinfer block size for hybrid ssm models #27843

Closed

5 tasks

heheda12345 reviewed Oct 31, 2025

View reviewed changes

vllm/model_executor/models/config.py Outdated Show resolved Hide resolved

vadiklyutiy force-pushed the vadim/adj-kv-block-sizes branch from c397cfb to 1ca933c Compare October 31, 2025 01:17

vadiklyutiy added a commit to CentML/vllm that referenced this pull request Oct 31, 2025

Inspired by vllm-project#27704: adjust alignment_size for FlashInfer …

4f5bc88

…attn backend instead of rounding to next power of 2 Signed-off-by: Vadim Gimpelson <[email protected]>

vadiklyutiy requested review from heheda12345 and pavanimajety October 31, 2025 15:47

heheda12345 reviewed Oct 31, 2025

View reviewed changes

heheda12345 added this to the v0.11.1 milestone Nov 1, 2025

vadiklyutiy requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners November 3, 2025 10:24

heheda12345 mentioned this pull request Nov 3, 2025

[FlashInfer] Disable TRTLLM for block_size 16 and head_size 256 #28001

Closed

5 tasks

MengqingCao reviewed Nov 4, 2025

View reviewed changes

vadiklyutiy force-pushed the vadim/adj-kv-block-sizes branch from c8e821d to 44bc26b Compare November 4, 2025 13:38

vadiklyutiy requested a review from LucasWilkinson as a code owner November 4, 2025 13:38

mergify bot added rocm Related to AMD ROCm tpu Related to Google TPUs labels Nov 4, 2025

heheda12345 reviewed Nov 4, 2025

View reviewed changes

vadiklyutiy closed this Nov 5, 2025

vadiklyutiy deleted the vadim/adj-kv-block-sizes branch November 5, 2025 00:41

vadiklyutiy restored the vadim/adj-kv-block-sizes branch November 5, 2025 00:41

vadiklyutiy reopened this Nov 5, 2025

mergify bot added needs-rebase and removed needs-rebase labels Nov 5, 2025

vadiklyutiy added 5 commits November 6, 2025 01:03

adjust block size according attn supported kernel sizes

ac1020f

Signed-off-by: Vadim Gimpelson <[email protected]>

fixes

3c284e2

Signed-off-by: Vadim Gimpelson <[email protected]>

fixes

2221dd1

Signed-off-by: Vadim Gimpelson <[email protected]>

avoid cuda initalisation during AttentionConfig.verify_and_update_con…

3a51814

…fig call Signed-off-by: Vadim Gimpelson <[email protected]>

fix test

6e4d374

Signed-off-by: Vadim Gimpelson <[email protected]>

vadiklyutiy force-pushed the vadim/adj-kv-block-sizes branch from abdd895 to 6e4d374 Compare November 5, 2025 21:07

try to apply for all backend (before was FI only)

ecd76fd

Signed-off-by: Vadim Gimpelson <[email protected]>

heheda12345 removed this from the v0.11.1 milestone Nov 9, 2025

mergify bot added the nvidia label Nov 11, 2025

github-project-automation bot added this to NVIDIA Nov 11, 2025

mergify bot added the needs-rebase label Nov 11, 2025

Uh oh!

[BUGFIX] Adjust kv block sizes #27704

Are you sure you want to change the base?

[BUGFIX] Adjust kv block sizes #27704

Conversation

vadiklyutiy commented Oct 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Details

Uh oh!

vadiklyutiy commented Oct 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

pavanimajety commented Oct 30, 2025

Uh oh!

vadiklyutiy commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Oct 30, 2025

Uh oh!

Uh oh!

vadiklyutiy commented Oct 31, 2025

Uh oh!

vadiklyutiy commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhiyuan1i commented Nov 4, 2025

Uh oh!

vadiklyutiy commented Nov 4, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy commented Nov 5, 2025

Uh oh!

vadiklyutiy commented Nov 5, 2025

Uh oh!

mergify bot commented Nov 5, 2025

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

vadiklyutiy commented Oct 29, 2025 •

edited by github-actions bot

Loading

vadiklyutiy commented Oct 30, 2025 •

edited

Loading