-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[BUGFIX] Adjust kv block sizes #27704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses two separate issues with KV block sizes. The first change, removing block size 16 for the FlashInfer backend, is a valid workaround for an upstream bug and is well-commented. However, the second change, which rounds up the attention block size for Mamba hybrid models to the next power of two, introduces a critical issue. The resulting block size is not guaranteed to be supported by the attention backend (e.g., FlashInfer only supports [32, 64]). This will cause the block size to be overridden to a smaller, default value, which in turn violates Mamba's memory requirements and will likely lead to runtime failures or memory corruption. I have left a critical comment explaining this flaw in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@vadiklyutiy Could we test a mamba model and qwen3 next here with mini tests to see that both paths work? |
I tested qwen3-next. The accuracy bug with trt-llm gen attn "fixed" |
pavanimajety
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please add the evals.
|
I'm thinking of only hardcode one block_size. #27843 |
c397cfb to
1ca933c
Compare
…attn backend instead of rounding to next power of 2 Signed-off-by: Vadim Gimpelson <[email protected]>
|
"support of page size 16" seems as more hardcoded than expected. |
Lets me this how we can better fix it |
| # but on Blackwell, only support a page size of | ||
| # 16, 32, 64 | ||
| return [16, 32, 64] | ||
| # TODO: 16 is temporary removed because TRT-LLM kernel has a bug when using 16. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the problem only exist in trtllm, what about only remove 16 for trtllm like
if current_platform.is_device_capability(100):
return [32, 64]
else:
return [16, 32, 64]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we should only override on Blackwell
vllm/model_executor/layers/config.py
Outdated
| # For now enable it for FlashInfer only. | ||
| # Other backend need debugging. | ||
| # TODO: enable it for all backends. | ||
| if backend_cls.get_name() != "FLASHINFER": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope this could be done in this pr, I think this is safe to enable for all backends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of fails in CI. debugging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 hope this PR can enable all backends.
c8e821d to
44bc26b
Compare
|
I was wondering if there was a way to achieve reasonable abstraction if this method was easy to miss a certain backend modification. The block size capability of each backend on different hardware is implemented by the maintainer of the backend, and then in the config we call the backend class and obtain the corresponding supported block size |
attn backend has |
vllm/model_executor/layers/config.py
Outdated
| # For now enable it for FlashInfer only. | ||
| # Other backend need debugging. | ||
| # TODO: enable it for all backends. | ||
| if backend_cls.get_name() != "FLASHINFER": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 hope this PR can enable all backends.
| if cache_config.block_size is None: | ||
| new_block_size = min_size | ||
| else: | ||
| new_block_size = lcm(cache_config.block_size, min_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to raise an error if the user sets a block_size but the block_size is not supported by the attention backend it selects.
| else: | ||
| new_block_size = lcm(cache_config.block_size, min_size) | ||
|
|
||
| if cache_config.block_size is None or new_block_size != cache_config.block_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to add info-level logging if block_size is None and is initialized normally.
| ).page_size_bytes | ||
| else: | ||
| kernel_block_alignment_size = 16 | ||
| if cache_config.block_size is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this part called before or after AttentionConfig.verify_and_update_config(self)?
I think the kernel_block_alignment_size should be resolved from backend_cls.get_supported_kernel_block_size
|
closed by mistake |
1 similar comment
|
closed by mistake |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
…fig call Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
abdd895 to
6e4d374
Compare
Signed-off-by: Vadim Gimpelson <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Adjust kv_manager_block_size according to minimum supported
kernel_block_sizesby attn backedTemporary remove
16fromsupported_kernel_block_sizefor flashinfer due to TRT-LLM Gen attn has a bug forpage_size=16([BUG] TRT-LLM Gen full attn. Incorrect result forhead_dim=256flashinfer-ai/flashinfer#1993)Details
If
kv_manager_block_size is Nonewe set the default value16. Before I tried to remove16due to point 2. above every backend supportkernel_block_sizeequal to 16 and everything worked fine. But when no 16 (and less) there is a fail becausekv_manager_block_size % kernel_block_sizeshould be=0. I fixed it by adjustingkv_manager_block_sizewithget_supported_kernel_block_size().