How to use tensor_parallel_size for vllm reference in GRPO? #2814

bannima · 2025-02-10T07:09:47Z

GRPO use vllm to load reference model for data sampling , The limitation is that tensor parallel are not supported.
What if the reference model is larger than One GPU can hold, for example, 72B with 40GB's H800,

Is there any setting we can set the tensor_parallel_size for vllm params?

        if self.accelerator.is_main_process:
                vllm_device = self.args.vllm_device
                if vllm_device == "auto":
                    vllm_device = f"cuda:{self.accelerator.num_processes}"  # take the next GPU idx
                # Check that the requested device is available
                if vllm_device.split(":")[0] == "cuda" and int(vllm_device.split(":")[1]) >= torch.cuda.device_count():
                    raise ValueError(
                        f"The requested device for vllm ({vllm_device}) is not available. You are likely using vLLM "
                        "without restricting the number of GPUs for training. Set the `--num_processes` argument to a "
                        "value lower than the number of GPUs available on your machine—typically, reducing it by one "
                        f"is sufficient. In your case: `--num_processes {torch.cuda.device_count() - 1}`."
                    )
                # Check that the requested device is not also used for training
                if vllm_device in {f"cuda:{idx}" for idx in range(self.accelerator.num_processes)}:
                    warnings.warn(
                        f"The requested device {vllm_device} is also used for training. This may lead to unexpected "
                        "behavior. It is recommended to use a dedicated device for vLLM."
                    )
                # vLLM is not compatible with accelerate. So we need to patch it to make sure we can (1) place the vLLM
                # model on the desired device (world_size_patch) and (2) avoid a test that is not designed for our
                # setting (profiling_patch).
                world_size_patch = patch("torch.distributed.get_world_size", return_value=1)
                profiling_patch = patch(
                    "vllm.worker.worker.Worker._assert_memory_footprint_increased_during_profiling", return_value=None
                )
                with world_size_patch, profiling_patch:
                    self.llm = LLM(
                        model=model.name_or_path,
                        device=vllm_device,
                        gpu_memory_utilization=self.args.vllm_gpu_memory_utilization,
                        dtype=self.args.vllm_dtype,
                        # Automatic Prefix Caching caches the KV cache of existing queries, so that a new query can
                        # directly reuse the KV cache if it shares the same prefix with one of the existing queries.
                        # This is particularly useful here because we generate completions from the same prompts.
                        enable_prefix_caching=True,
                        max_model_len=self.args.vllm_max_model_len,
                    )
                self.sampling_params = SamplingParams(
                    temperature=args.temperature,
                    max_tokens=self.max_completion_length,
                )```

The text was updated successfully, but these errors were encountered:

Superskyyy · 2025-02-10T07:20:48Z

For Multi-Node it is currently not possible, but we are working on it.

Meanwhile if your training only uses <= 8carda. You can try to make vllm work on a single node while reserving two cards, and set in GRPOTrainer LLM to use tp=2. It should work.

whw199833 · 2025-02-10T10:19:58Z

+1 I need multi-node training too. And Qwen72B cannot launch within 1 carda.

ticosir · 2025-02-21T07:07:58Z

+1 I need multi-node training too. help~

yuanzhoulvpi2017 · 2025-02-28T02:54:39Z

I currently need to train a 32b model using grpo. The setup includes 8 H800 GPUs.
hope to use 6 GPUs for accelerate training and 2 GPUs for vllm(tp=2) inference. It seems that this configuration is not supported at the moment.hope this can be supported soon~

luoruikun · 2025-03-03T08:17:58Z

Can't wait to see the multi-node training available!

mengban · 2025-03-04T11:40:11Z

what？ I can't believe that GRPO not support multi-mode training for now.

github-actions bot added ⚡accelerate Related to accelerate 🏋 GRPO Related to GRPO labels Feb 10, 2025

abc123000111 mentioned this issue Feb 12, 2025

请问支持模型张量并行训练吗？ Deep-Agent/R1-V#59

Open

binary-husky mentioned this issue Mar 16, 2025

🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication #3094

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use tensor_parallel_size for vllm reference in GRPO? #2814

How to use tensor_parallel_size for vllm reference in GRPO? #2814

bannima commented Feb 10, 2025

Superskyyy commented Feb 10, 2025 •

edited

Loading

whw199833 commented Feb 10, 2025

ticosir commented Feb 21, 2025

yuanzhoulvpi2017 commented Feb 28, 2025

luoruikun commented Mar 3, 2025

mengban commented Mar 4, 2025

How to use tensor_parallel_size for vllm reference in GRPO? #2814

How to use tensor_parallel_size for vllm reference in GRPO? #2814

Comments

bannima commented Feb 10, 2025

Superskyyy commented Feb 10, 2025 • edited Loading

whw199833 commented Feb 10, 2025

ticosir commented Feb 21, 2025

yuanzhoulvpi2017 commented Feb 28, 2025

luoruikun commented Mar 3, 2025

mengban commented Mar 4, 2025

Superskyyy commented Feb 10, 2025 •

edited

Loading