[CI/Testing] Add basic single node dual batch overlap test #27235

LucasWilkinson · 2025-10-21T00:59:17Z

Ideally we'd do a multi-node test; but add a single node test for now to make sure we atleast get some coverage

Not sure exactly which test suite to put it in; uses DeepEP so needs to be run on hopper or Blackwell

Signed-off-by: Lucas Wilkinson <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a new test for Dual Batch Overlap (DBO) with Data Parallelism and Expert Parallelism. The test is well-structured, using a GSM8K evaluation to verify correctness on a multi-GPU single-node setup. The CI configuration is also updated to run this test on H100 GPUs. My review found one high-severity issue related to missing test dependencies in the CI configuration, which could lead to the test not running when its helper utilities are modified. Otherwise, the changes are solid and a good addition to the test suite.

gemini-code-assist · 2025-10-21T01:01:13Z

.buildkite/test-pipeline.yaml

+  source_file_dependencies:
+    - docker/Dockerfile # To catch DeepEP updates
+    - vllm/model_executor/layers/fused_moe
+    - vllm/distributed/device_communicators
+    - vllm/v1/worker/
+    - vllm/v1/attention/backends/utils.py


The source_file_dependencies list is missing dependencies on the test utility files used by tests/v1/distributed/test_dbo.py. The test imports from tests.evals.gsm8k.gsm8k_eval and tests.utils. Changes to these files could affect the test's behavior or correctness, but they won't trigger this test run. Please add them to the dependency list to ensure the test is run when its dependencies change.

source_file_dependencies: - tests/evals/gsm8k/gsm8k_eval.py - tests/utils.py - docker/Dockerfile # To catch DeepEP updates - vllm/model_executor/layers/fused_moe - vllm/distributed/device_communicators - vllm/v1/worker/ - vllm/v1/attention/backends/utils.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-10-21T01:02:14Z

.buildkite/test-pipeline.yaml

+- label: Distributed Tests (H100) # optional
+  gpu: h100
+  working_dir: "/vllm-workspace/"
+  num_gpus: 2
+  commands: 
+    - pytest -v -s tests/v1/distributed/test_dbo.py
+  source_file_dependencies:


Mark H100 DBO step optional

The new H100 pipeline step is commented as optional but the Buildkite block doesn’t set optional: true. Without that flag Buildkite will treat the step as required, so every CI run now waits for an H100 agent even when the queue has none available. This effectively blocks the pipeline whenever H100 hardware isn’t provisioned, defeating the stated intent of having an optional dual batch overlap test.

Useful? React with 👍 / 👎.

Seems like a reasonable suggestion

Moved it to B200 and H200 nightly 👍 (per suggestion from @mgoin)

Signed-off-by: Lucas Wilkinson <[email protected]>

SageMoore

Thanks for the test @LucasWilkinson

SageMoore · 2025-10-21T03:24:27Z

tests/v1/distributed/test_dbo.py

+        # Note: Not using --enforce-eager to test DBO's alternate CUDA graph dispatching
+        "--data-parallel-size", str(DP_SIZE),
+        "--enable-expert-parallel",
+        "--enable-dbo",


Do we want to drop the decode threshold as well?

we could; I already verified that we hit cases above and below both thresholds but probably good to fix them so if they get updated we don't suddenly start testing no-DBO

Signed-off-by: Lucas Wilkinson <[email protected]>

dbo test

42903dc

Signed-off-by: Lucas Wilkinson <[email protected]>

mergify bot added ci/build v1 labels Oct 21, 2025

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 21, 2025

View reviewed changes

make sure we surpass thresholds

cb06cb6

Signed-off-by: Lucas Wilkinson <[email protected]>

SageMoore reviewed Oct 21, 2025

View reviewed changes

review comments

074e8f0

Signed-off-by: Lucas Wilkinson <[email protected]>

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 21, 2025

SageMoore approved these changes Oct 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI/Testing] Add basic single node dual batch overlap test #27235

[CI/Testing] Add basic single node dual batch overlap test #27235

LucasWilkinson commented Oct 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 21, 2025

Uh oh!

tlrmchlsmth Oct 21, 2025

Uh oh!

LucasWilkinson Oct 21, 2025

Uh oh!

SageMoore left a comment

Uh oh!

SageMoore Oct 21, 2025

Uh oh!

LucasWilkinson Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[CI/Testing] Add basic single node dual batch overlap test #27235

Are you sure you want to change the base?

[CI/Testing] Add basic single node dual batch overlap test #27235

Conversation

LucasWilkinson commented Oct 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

SageMoore Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LucasWilkinson commented Oct 21, 2025 •

edited by github-actions bot

Loading