-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[WIP] Tmp fix for IMA with MTP = 2 and full-cg #28315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Tmp fix for IMA with MTP = 2 and full-cg #28315
Conversation
Temp fix for vllm-project#28207 Signed-off-by: Lucas Wilkinson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a temporary fix for an issue with multi-token prediction and full CUDA graphs by adjusting CUDA graph capture sizes. The core logic change is in a new method, adjust_cudagraph_sizes_to_be_multipe_of, which unfortunately contains a critical bug that can lead to runtime errors and incorrect behavior. I've provided a detailed review comment with a suggested fix for this issue.
vllm/config/vllm.py
Outdated
| def adjust_cudagraph_sizes_to_be_multipe_of(self, multiple_of: int): | ||
| new_sizes = sorted( | ||
| [ | ||
| round_up(size, multiple_of) | ||
| for size in self.compilation_config.cudagraph_capture_sizes | ||
| ] | ||
| ) | ||
| if new_sizes[-1] > self.compilation_config.max_cudagraph_capture_size: | ||
| new_sizes = new_sizes[:-1] | ||
| self.compilation_config.max_cudagraph_capture_size = new_sizes[-1] | ||
| self.compilation_config.cudagraph_capture_sizes = new_sizes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of adjust_cudagraph_sizes_to_be_multipe_of has several critical issues that can lead to incorrect behavior or runtime errors:
-
Potential
IndexError: If allcudagraph_capture_sizes, when rounded up, exceedmax_cudagraph_capture_size, thenew_sizeslist can become empty after theifcondition, leading to anIndexErroronnew_sizes[-1]. For example, ifcudagraph_capture_sizesis[16],max_cudagraph_capture_sizeis16, andmultiple_ofis20,new_sizesbecomes[20]. Theifcondition is met, andnew_sizesis modified to[], causing a crash on the next line. -
Incorrect Filtering: The logic
if new_sizes[-1] > ...: new_sizes = new_sizes[:-1]only checks and removes the largest element. If multiple rounded-up sizes exceedmax_cudagraph_capture_size, the smaller ones will incorrectly remain in the list. -
Incorrect
max_cudagraph_capture_sizeupdate: Themax_cudagraph_capture_sizecan be updated to a value larger than its original value, which seems to contradict its purpose as a hard limit derived from scheduler and token configurations.
I suggest a more robust implementation that correctly filters the sizes and handles edge cases gracefully.
Additionally, there is a typo in the method name (multipe_of should be multiple_of). I've kept it in the suggestion to match the current code, but it should be corrected here and at the call site.
def adjust_cudagraph_sizes_to_be_multipe_of(self, multiple_of: int):
max_size = self.compilation_config.max_cudagraph_capture_size
# Use a set to handle duplicates from rounding up
rounded_sizes = {
round_up(size, multiple_of)
for size in self.compilation_config.cudagraph_capture_sizes
}
new_sizes = sorted([s for s in rounded_sizes if s <= max_size])
if not new_sizes:
# All rounded-up sizes exceeded the max size.
# Disable cudagraphs by setting sizes to empty.
self.compilation_config.max_cudagraph_capture_size = 0
self.compilation_config.cudagraph_capture_sizes = []
return
self.compilation_config.max_cudagraph_capture_size = new_sizes[-1]
self.compilation_config.cudagraph_capture_sizes = new_sizesSigned-off-by: Lucas Wilkinson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| enable_str, | ||
| op, | ||
| ) | ||
|
|
||
| def compute_bs_to_padded_graph_size(self): | ||
| # pre-compute the mapping from batch size to padded graph size | ||
| self.bs_to_padded_graph_size = [ | ||
| 0 for i in range(self.max_cudagraph_capture_size + 1) | ||
| ] | ||
| for end, start in zip( | ||
| self.cudagraph_capture_sizes + [self.max_cudagraph_capture_size + 1], | ||
| [0] + self.cudagraph_capture_sizes, | ||
| ): | ||
| for bs in range(start, end): | ||
| if bs == start: | ||
| self.bs_to_padded_graph_size[bs] = start | ||
| else: | ||
| self.bs_to_padded_graph_size[bs] = end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize padding map during config construction
The mapping from batch size to padded cudagraph size is now built only via the new compute_bs_to_padded_graph_size() helper, but post_init_cudagraph_sizes() no longer invokes it. VllmConfig.pad_for_cudagraph() still accesses compilation_config.bs_to_padded_graph_size and can be called right after EngineArgs.create_engine_config() (e.g., test_mamba_cache_cg_padding) before any GPUModelRunner triggers the new computation, resulting in TypeError: 'NoneType' object is not subscriptable. The mapping should still be populated during configuration initialization or lazily on first use so existing callers do not crash when they query padding before the model runner is constructed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the computation of bs_to_padded_graph_size and introduces logic to adjust CUDA graph capture sizes. While the intent is to fix an issue with speculative decoding, the changes introduce two critical bugs. First, the refactoring of bs_to_padded_graph_size computation breaks the model initialization order, as it's now computed after profile_run which depends on it. Second, the new method to adjust capture sizes contains a typo and is vulnerable to an IndexError if it results in an empty list of sizes. I have provided detailed comments and suggestions to fix these critical issues.
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
Temp fix for #28207