Skip to content

Comments

fix: Add patch for placement groups in local_vllm_model.#694

Open
ffrujeri wants to merge 2 commits intomainfrom
ffrujeri/multi-node-local-vllm
Open

fix: Add patch for placement groups in local_vllm_model.#694
ffrujeri wants to merge 2 commits intomainfrom
ffrujeri/multi-node-local-vllm

Conversation

@ffrujeri
Copy link
Contributor

@ffrujeri ffrujeri commented Feb 13, 2026

What does this PR do?

Fixes vLLM v1 engine data-parallel placement group creation on multi-node Ray clusters so that exactly dp_size placement groups are created instead of one per node.

Issues

Fixes https://github.com/NVIDIA-NeMo/Internal-Planning/issues/148

Usage

No API or config changes. When using LocalVLLMModel with data parallel (data_parallel_size > 1) on a multi-node Ray cluster, the patch is applied automatically before the vLLM server starts. Existing workflows continue to work; the fix only corrects placement group creation so multi-node DP no longer hits the assertion.

Additional Information

  • Root cause: In vLLM v1, CoreEngineActorManager.create_dp_placement_groups uses a nested loop over nodes and per-node DP allocation. The inner break when len(placement_groups) == dp_size only exits the inner loop; the outer loop over nodes continues and creates one placement group per node, triggering AssertionError: Created N DP placement groups, expected M.

  • Fix: A new module nemo_gym/vllm_patches provides apply_vllm_dp_placement_groups_patch(), which replaces that method with a version that also breaks out of the outer node loop once dp_size placement groups are created. The patch is idempotent and is applied in LocalVLLMModelActor.__init__ before starting the vLLM server.

  • Testing

ng_run "+config_paths=[responses_api_models/local_vllm_model/configs/qwen3_235b_a22b_instruct_2507.yaml]" \
    ++qwen3_235b_a22b_instruct_2507_model_server.responses_api_models.local_vllm_model.vllm_serve_kwargs.tensor_parallel_size=8 \
    ++qwen3_235b_a22b_instruct_2507_model_server.responses_api_models.local_vllm_model.vllm_serve_kwargs.pipeline_parallel_size=1 \
    ++qwen3_235b_a22b_instruct_2507_model_server.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size=2 \
    ++qwen3_235b_a22b_instruct_2507_model_server.responses_api_models.local_vllm_model.vllm_serve_kwargs.data_parallel_size_local=1 \
    ++qwen3_235b_a22b_instruct_2507_model_server.responses_api_models.local_vllm_model.vllm_serve_kwargs.model_loader_extra_config.num_threads=128 \
    ++qwen3_235b_a22b_instruct_2507_model_server.responses_api_models.local_vllm_model.vllm_serve_kwargs.max_num_seqs=100 \
    ++qwen3_235b_a22b_instruct_2507_model_server.responses_api_models.local_vllm_model.debug=true \
    ++qwen3_235b_a22b_instruct_2507_model_server.responses_api_models.local_vllm_model.vllm_serve_env_vars.VLLM_RAY_DP_PACK_STRATEGY=strict \
    ++use_absolute_ip=true
All 1 / 1 servers ready! Polling every 60s

####################################################################################################
#
# Server Instances
#
####################################################################################################

[1] qwen3_235b_a22b_instruct_2507_model_server (responses_api_models/local_vllm_model)
{
    'config_path': 'qwen3_235b_a22b_instruct_2507_model_server',
    'dir_path': (
        '/scratch/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/ffrujeri/Gym/responses_api_mo'
        'dels/local_vllm_model'
    ),
    'entrypoint': 'app.py',
    'host': '100.67.226.182',
    'name': 'local_vllm_model',
    'pid': 23911,
    'port': 12382,
    'process_name': 'qwen3_235b_a22b_instruct_2507_model_server',
    'server_type': 'responses_api_models',
    'url': 'http://100.67.226.182:12382',
}
####################################################################################################

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ffrujeri ffrujeri changed the title Add patch for placement groups in local_vllm_model. fix: Add patch for placement groups in local_vllm_model. Feb 17, 2026
@ffrujeri ffrujeri marked this pull request as ready for review February 18, 2026 02:38
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
@ffrujeri ffrujeri force-pushed the ffrujeri/multi-node-local-vllm branch from 7d3a839 to 2971f31 Compare February 18, 2026 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant