-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[CI/Testing] Add basic single node dual batch overlap test #27235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Lucas Wilkinson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new test for Dual Batch Overlap (DBO) with Data Parallelism and Expert Parallelism. The test is well-structured, using a GSM8K evaluation to verify correctness on a multi-GPU single-node setup. The CI configuration is also updated to run this test on H100 GPUs. My review found one high-severity issue related to missing test dependencies in the CI configuration, which could lead to the test not running when its helper utilities are modified. Otherwise, the changes are solid and a good addition to the test suite.
.buildkite/test-pipeline.yaml
Outdated
source_file_dependencies: | ||
- docker/Dockerfile # To catch DeepEP updates | ||
- vllm/model_executor/layers/fused_moe | ||
- vllm/distributed/device_communicators | ||
- vllm/v1/worker/ | ||
- vllm/v1/attention/backends/utils.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source_file_dependencies
list is missing dependencies on the test utility files used by tests/v1/distributed/test_dbo.py
. The test imports from tests.evals.gsm8k.gsm8k_eval
and tests.utils
. Changes to these files could affect the test's behavior or correctness, but they won't trigger this test run. Please add them to the dependency list to ensure the test is run when its dependencies change.
source_file_dependencies:
- tests/evals/gsm8k/gsm8k_eval.py
- tests/utils.py
- docker/Dockerfile # To catch DeepEP updates
- vllm/model_executor/layers/fused_moe
- vllm/distributed/device_communicators
- vllm/v1/worker/
- vllm/v1/attention/backends/utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
.buildkite/test-pipeline.yaml
Outdated
- label: Distributed Tests (H100) # optional | ||
gpu: h100 | ||
working_dir: "/vllm-workspace/" | ||
num_gpus: 2 | ||
commands: | ||
- pytest -v -s tests/v1/distributed/test_dbo.py | ||
source_file_dependencies: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new H100 pipeline step is commented as optional but the Buildkite block doesn’t set optional: true
. Without that flag Buildkite will treat the step as required, so every CI run now waits for an H100 agent even when the queue has none available. This effectively blocks the pipeline whenever H100 hardware isn’t provisioned, defeating the stated intent of having an optional dual batch overlap test.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a reasonable suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it to B200 and H200 nightly 👍 (per suggestion from @mgoin)
Signed-off-by: Lucas Wilkinson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test @LucasWilkinson
# Note: Not using --enforce-eager to test DBO's alternate CUDA graph dispatching | ||
"--data-parallel-size", str(DP_SIZE), | ||
"--enable-expert-parallel", | ||
"--enable-dbo", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to drop the decode threshold as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could; I already verified that we hit cases above and below both thresholds but probably good to fix them so if they get updated we don't suddenly start testing no-DBO
Signed-off-by: Lucas Wilkinson <[email protected]>
Ideally we'd do a multi-node test; but add a single node test for now to make sure we atleast get some coverage
Not sure exactly which test suite to put it in; uses DeepEP so needs to be run on hopper or Blackwell