Skip to content

Conversation

LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Oct 21, 2025

Ideally we'd do a multi-node test; but add a single node test for now to make sure we atleast get some coverage

Not sure exactly which test suite to put it in; uses DeepEP so needs to be run on hopper or Blackwell

Signed-off-by: Lucas Wilkinson <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new test for Dual Batch Overlap (DBO) with Data Parallelism and Expert Parallelism. The test is well-structured, using a GSM8K evaluation to verify correctness on a multi-GPU single-node setup. The CI configuration is also updated to run this test on H100 GPUs. My review found one high-severity issue related to missing test dependencies in the CI configuration, which could lead to the test not running when its helper utilities are modified. Otherwise, the changes are solid and a good addition to the test suite.

Comment on lines 1141 to 1146
source_file_dependencies:
- docker/Dockerfile # To catch DeepEP updates
- vllm/model_executor/layers/fused_moe
- vllm/distributed/device_communicators
- vllm/v1/worker/
- vllm/v1/attention/backends/utils.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The source_file_dependencies list is missing dependencies on the test utility files used by tests/v1/distributed/test_dbo.py. The test imports from tests.evals.gsm8k.gsm8k_eval and tests.utils. Changes to these files could affect the test's behavior or correctness, but they won't trigger this test run. Please add them to the dependency list to ensure the test is run when its dependencies change.

  source_file_dependencies:
    - tests/evals/gsm8k/gsm8k_eval.py
    - tests/utils.py
    - docker/Dockerfile # To catch DeepEP updates
    - vllm/model_executor/layers/fused_moe
    - vllm/distributed/device_communicators
    - vllm/v1/worker/
    - vllm/v1/attention/backends/utils.py

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 1135 to 1141
- label: Distributed Tests (H100) # optional
gpu: h100
working_dir: "/vllm-workspace/"
num_gpus: 2
commands:
- pytest -v -s tests/v1/distributed/test_dbo.py
source_file_dependencies:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Mark H100 DBO step optional

The new H100 pipeline step is commented as optional but the Buildkite block doesn’t set optional: true. Without that flag Buildkite will treat the step as required, so every CI run now waits for an H100 agent even when the queue has none available. This effectively blocks the pipeline whenever H100 hardware isn’t provisioned, defeating the stated intent of having an optional dual batch overlap test.

Useful? React with 👍 / 👎.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable suggestion

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to B200 and H200 nightly 👍 (per suggestion from @mgoin)

Signed-off-by: Lucas Wilkinson <[email protected]>
Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test @LucasWilkinson

# Note: Not using --enforce-eager to test DBO's alternate CUDA graph dispatching
"--data-parallel-size", str(DP_SIZE),
"--enable-expert-parallel",
"--enable-dbo",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to drop the decode threshold as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could; I already verified that we hit cases above and below both thresholds but probably good to fix them so if they get updated we don't suddenly start testing no-DBO

Signed-off-by: Lucas Wilkinson <[email protected]>
@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants