Skip to content

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Nov 7, 2025

Temp fix for #28207

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a temporary fix for an issue with multi-token prediction and full CUDA graphs by adjusting CUDA graph capture sizes. The core logic change is in a new method, adjust_cudagraph_sizes_to_be_multipe_of, which unfortunately contains a critical bug that can lead to runtime errors and incorrect behavior. I've provided a detailed review comment with a suggested fix for this issue.

Comment on lines 212 to 222
def adjust_cudagraph_sizes_to_be_multipe_of(self, multiple_of: int):
new_sizes = sorted(
[
round_up(size, multiple_of)
for size in self.compilation_config.cudagraph_capture_sizes
]
)
if new_sizes[-1] > self.compilation_config.max_cudagraph_capture_size:
new_sizes = new_sizes[:-1]
self.compilation_config.max_cudagraph_capture_size = new_sizes[-1]
self.compilation_config.cudagraph_capture_sizes = new_sizes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation of adjust_cudagraph_sizes_to_be_multipe_of has several critical issues that can lead to incorrect behavior or runtime errors:

  1. Potential IndexError: If all cudagraph_capture_sizes, when rounded up, exceed max_cudagraph_capture_size, the new_sizes list can become empty after the if condition, leading to an IndexError on new_sizes[-1]. For example, if cudagraph_capture_sizes is [16], max_cudagraph_capture_size is 16, and multiple_of is 20, new_sizes becomes [20]. The if condition is met, and new_sizes is modified to [], causing a crash on the next line.

  2. Incorrect Filtering: The logic if new_sizes[-1] > ...: new_sizes = new_sizes[:-1] only checks and removes the largest element. If multiple rounded-up sizes exceed max_cudagraph_capture_size, the smaller ones will incorrectly remain in the list.

  3. Incorrect max_cudagraph_capture_size update: The max_cudagraph_capture_size can be updated to a value larger than its original value, which seems to contradict its purpose as a hard limit derived from scheduler and token configurations.

I suggest a more robust implementation that correctly filters the sizes and handles edge cases gracefully.

Additionally, there is a typo in the method name (multipe_of should be multiple_of). I've kept it in the suggestion to match the current code, but it should be corrected here and at the call site.

    def adjust_cudagraph_sizes_to_be_multipe_of(self, multiple_of: int):
        max_size = self.compilation_config.max_cudagraph_capture_size
        # Use a set to handle duplicates from rounding up
        rounded_sizes = {
            round_up(size, multiple_of)
            for size in self.compilation_config.cudagraph_capture_sizes
        }
        new_sizes = sorted([s for s in rounded_sizes if s <= max_size])

        if not new_sizes:
            # All rounded-up sizes exceeded the max size.
            # Disable cudagraphs by setting sizes to empty.
            self.compilation_config.max_cudagraph_capture_size = 0
            self.compilation_config.cudagraph_capture_sizes = []
            return

        self.compilation_config.max_cudagraph_capture_size = new_sizes[-1]
        self.compilation_config.cudagraph_capture_sizes = new_sizes

Signed-off-by: Lucas Wilkinson <[email protected]>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 888 to 891
enable_str,
op,
)

def compute_bs_to_padded_graph_size(self):
# pre-compute the mapping from batch size to padded graph size
self.bs_to_padded_graph_size = [
0 for i in range(self.max_cudagraph_capture_size + 1)
]
for end, start in zip(
self.cudagraph_capture_sizes + [self.max_cudagraph_capture_size + 1],
[0] + self.cudagraph_capture_sizes,
):
for bs in range(start, end):
if bs == start:
self.bs_to_padded_graph_size[bs] = start
else:
self.bs_to_padded_graph_size[bs] = end

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Initialize padding map during config construction

The mapping from batch size to padded cudagraph size is now built only via the new compute_bs_to_padded_graph_size() helper, but post_init_cudagraph_sizes() no longer invokes it. VllmConfig.pad_for_cudagraph() still accesses compilation_config.bs_to_padded_graph_size and can be called right after EngineArgs.create_engine_config() (e.g., test_mamba_cache_cg_padding) before any GPUModelRunner triggers the new computation, resulting in TypeError: 'NoneType' object is not subscriptable. The mapping should still be populated during configuration initialization or lazily on first use so existing callers do not crash when they query padding before the model runner is constructed.

Useful? React with 👍 / 👎.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the computation of bs_to_padded_graph_size and introduces logic to adjust CUDA graph capture sizes. While the intent is to fix an issue with speculative decoding, the changes introduce two critical bugs. First, the refactoring of bs_to_padded_graph_size computation breaks the model initialization order, as it's now computed after profile_run which depends on it. Second, the new method to adjust capture sizes contains a typo and is vulnerable to an IndexError if it results in an empty list of sizes. I have provided detailed comments and suggestions to fix these critical issues.

@gemini-code-assist
Copy link
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant