Skip to content

Conversation

@vadiklyutiy
Copy link
Collaborator

@vadiklyutiy vadiklyutiy commented Oct 29, 2025

Purpose

  1. Adjust kv_manager_block_size according to minimum supported kernel_block_sizes by attn backed

  2. Temporary remove 16 from supported_kernel_block_size for flashinfer due to TRT-LLM Gen attn has a bug for page_size=16 ([BUG] TRT-LLM Gen full attn. Incorrect result for head_dim=256 flashinfer-ai/flashinfer#1993)

Details

If kv_manager_block_size is None we set the default value 16. Before I tried to remove 16 due to point 2. above every backend support kernel_block_size equal to 16 and everything worked fine. But when no 16 (and less) there is a fail because
kv_manager_block_size % kernel_block_size should be =0. I fixed it by adjusting kv_manager_block_size with get_supported_kernel_block_size().

@mergify mergify bot added the v1 label Oct 29, 2025
@vadiklyutiy
Copy link
Collaborator Author

CC @zhiyuan1i @heheda12345

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses two separate issues with KV block sizes. The first change, removing block size 16 for the FlashInfer backend, is a valid workaround for an upstream bug and is well-commented. However, the second change, which rounds up the attention block size for Mamba hybrid models to the next power of two, introduces a critical issue. The resulting block size is not guaranteed to be supported by the attention backend (e.g., FlashInfer only supports [32, 64]). This will cause the block size to be overridden to a smaller, default value, which in turn violates Mamba's memory requirements and will likely lead to runtime failures or memory corruption. I have left a critical comment explaining this flaw in detail.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@pavanimajety
Copy link
Collaborator

@vadiklyutiy Could we test a mamba model and qwen3 next here with mini tests to see that both paths work?

@vadiklyutiy
Copy link
Collaborator Author

vadiklyutiy commented Oct 30, 2025

@vadiklyutiy Could we test a mamba model and qwen3 next here with mini tests to see that both paths work?

I tested qwen3-next. The accuracy bug with trt-llm gen attn "fixed"

Copy link
Collaborator

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please add the evals.

@pavanimajety pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 30, 2025
@heheda12345
Copy link
Collaborator

I'm thinking of only hardcode one block_size. #27843

@vadiklyutiy vadiklyutiy force-pushed the vadim/adj-kv-block-sizes branch from c397cfb to 1ca933c Compare October 31, 2025 01:17
vadiklyutiy added a commit to CentML/vllm that referenced this pull request Oct 31, 2025
…attn backend instead of rounding to next power of 2

Signed-off-by: Vadim Gimpelson <[email protected]>
@vadiklyutiy
Copy link
Collaborator Author

"support of page size 16" seems as more hardcoded than expected.
I think it is worth to additionally review the new changes (after I fix CI)

@vadiklyutiy
Copy link
Collaborator Author

"support of page size 16" seems as more hardcoded than expected. I think it is worth to additionally review the new changes (after I fix CI)

Lets me this how we can better fix it

# but on Blackwell, only support a page size of
# 16, 32, 64
return [16, 32, 64]
# TODO: 16 is temporary removed because TRT-LLM kernel has a bug when using 16.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the problem only exist in trtllm, what about only remove 16 for trtllm like

if current_platform.is_device_capability(100):
    return [32, 64]
else:
    return [16, 32, 64]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we should only override on Blackwell

# For now enable it for FlashInfer only.
# Other backend need debugging.
# TODO: enable it for all backends.
if backend_cls.get_name() != "FLASHINFER":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope this could be done in this pr, I think this is safe to enable for all backends?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of fails in CI. debugging

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 hope this PR can enable all backends.

@zhiyuan1i
Copy link
Contributor

I was wondering if there was a way to achieve reasonable abstraction if this method was easy to miss a certain backend modification. The block size capability of each backend on different hardware is implemented by the maintainer of the backend, and then in the config we call the backend class and obtain the corresponding supported block size

@mergify mergify bot added rocm Related to AMD ROCm tpu Related to Google TPUs labels Nov 4, 2025
@vadiklyutiy
Copy link
Collaborator Author

I was wondering if there was a way to achieve reasonable abstraction if this method was easy to miss a certain backend modification. The block size capability of each backend on different hardware is implemented by the maintainer of the backend, and then in the config we call the backend class and obtain the corresponding supported block size

attn backend has get_supported_kernel_block_size() that provides supported sizes. Here I try to use this to choose right kv manager block size and put this adjustment call on the main path.
So, I didn't fully understand your question...

# For now enable it for FlashInfer only.
# Other backend need debugging.
# TODO: enable it for all backends.
if backend_cls.get_name() != "FLASHINFER":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 hope this PR can enable all backends.

if cache_config.block_size is None:
new_block_size = min_size
else:
new_block_size = lcm(cache_config.block_size, min_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to raise an error if the user sets a block_size but the block_size is not supported by the attention backend it selects.

else:
new_block_size = lcm(cache_config.block_size, min_size)

if cache_config.block_size is None or new_block_size != cache_config.block_size:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to add info-level logging if block_size is None and is initialized normally.

).page_size_bytes
else:
kernel_block_alignment_size = 16
if cache_config.block_size is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this part called before or after AttentionConfig.verify_and_update_config(self)?

I think the kernel_block_alignment_size should be resolved from backend_cls.get_supported_kernel_block_size

@vadiklyutiy vadiklyutiy closed this Nov 5, 2025
@vadiklyutiy vadiklyutiy deleted the vadim/adj-kv-block-sizes branch November 5, 2025 00:41
@vadiklyutiy vadiklyutiy restored the vadim/adj-kv-block-sizes branch November 5, 2025 00:41
@vadiklyutiy
Copy link
Collaborator Author

closed by mistake

1 similar comment
@vadiklyutiy
Copy link
Collaborator Author

closed by mistake

@vadiklyutiy vadiklyutiy reopened this Nov 5, 2025
@mergify
Copy link

mergify bot commented Nov 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
@vadiklyutiy vadiklyutiy force-pushed the vadim/adj-kv-block-sizes branch from abdd895 to 6e4d374 Compare November 5, 2025 21:07
@heheda12345 heheda12345 removed this from the v0.11.1 milestone Nov 9, 2025
@mergify mergify bot added the nvidia label Nov 11, 2025
@mergify
Copy link

mergify bot commented Nov 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm tpu Related to Google TPUs v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants