fix: Add patch for placement groups in local_vllm_model.#694
Open
fix: Add patch for placement groups in local_vllm_model.#694
Conversation
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
7d3a839 to
2971f31
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes vLLM v1 engine data-parallel placement group creation on multi-node Ray clusters so that exactly
dp_sizeplacement groups are created instead of one per node.Issues
Fixes https://github.com/NVIDIA-NeMo/Internal-Planning/issues/148
Usage
No API or config changes. When using
LocalVLLMModelwith data parallel (data_parallel_size> 1) on a multi-node Ray cluster, the patch is applied automatically before the vLLM server starts. Existing workflows continue to work; the fix only corrects placement group creation so multi-node DP no longer hits the assertion.Additional Information
Root cause: In vLLM v1,
CoreEngineActorManager.create_dp_placement_groupsuses a nested loop over nodes and per-node DP allocation. The innerbreakwhenlen(placement_groups) == dp_sizeonly exits the inner loop; the outer loop over nodes continues and creates one placement group per node, triggeringAssertionError: Created N DP placement groups, expected M.Fix: A new module
nemo_gym/vllm_patchesprovidesapply_vllm_dp_placement_groups_patch(), which replaces that method with a version that also breaks out of the outer node loop oncedp_sizeplacement groups are created. The patch is idempotent and is applied inLocalVLLMModelActor.__init__before starting the vLLM server.Testing