Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use tensor_parallel_size for vllm reference in GRPO? #2814

Open
bannima opened this issue Feb 10, 2025 · 6 comments
Open

How to use tensor_parallel_size for vllm reference in GRPO? #2814

bannima opened this issue Feb 10, 2025 · 6 comments
Labels
⚡accelerate Related to accelerate 🏋 GRPO Related to GRPO

Comments

@bannima
Copy link

bannima commented Feb 10, 2025

GRPO use vllm to load reference model for data sampling , The limitation is that tensor parallel are not supported.
What if the reference model is larger than One GPU can hold, for example, 72B with 40GB's H800,

Is there any setting we can set the tensor_parallel_size for vllm params?

        if self.accelerator.is_main_process:
                vllm_device = self.args.vllm_device
                if vllm_device == "auto":
                    vllm_device = f"cuda:{self.accelerator.num_processes}"  # take the next GPU idx
                # Check that the requested device is available
                if vllm_device.split(":")[0] == "cuda" and int(vllm_device.split(":")[1]) >= torch.cuda.device_count():
                    raise ValueError(
                        f"The requested device for vllm ({vllm_device}) is not available. You are likely using vLLM "
                        "without restricting the number of GPUs for training. Set the `--num_processes` argument to a "
                        "value lower than the number of GPUs available on your machine—typically, reducing it by one "
                        f"is sufficient. In your case: `--num_processes {torch.cuda.device_count() - 1}`."
                    )
                # Check that the requested device is not also used for training
                if vllm_device in {f"cuda:{idx}" for idx in range(self.accelerator.num_processes)}:
                    warnings.warn(
                        f"The requested device {vllm_device} is also used for training. This may lead to unexpected "
                        "behavior. It is recommended to use a dedicated device for vLLM."
                    )
                # vLLM is not compatible with accelerate. So we need to patch it to make sure we can (1) place the vLLM
                # model on the desired device (world_size_patch) and (2) avoid a test that is not designed for our
                # setting (profiling_patch).
                world_size_patch = patch("torch.distributed.get_world_size", return_value=1)
                profiling_patch = patch(
                    "vllm.worker.worker.Worker._assert_memory_footprint_increased_during_profiling", return_value=None
                )
                with world_size_patch, profiling_patch:
                    self.llm = LLM(
                        model=model.name_or_path,
                        device=vllm_device,
                        gpu_memory_utilization=self.args.vllm_gpu_memory_utilization,
                        dtype=self.args.vllm_dtype,
                        # Automatic Prefix Caching caches the KV cache of existing queries, so that a new query can
                        # directly reuse the KV cache if it shares the same prefix with one of the existing queries.
                        # This is particularly useful here because we generate completions from the same prompts.
                        enable_prefix_caching=True,
                        max_model_len=self.args.vllm_max_model_len,
                    )
                self.sampling_params = SamplingParams(
                    temperature=args.temperature,
                    max_tokens=self.max_completion_length,
                )```
@github-actions github-actions bot added ⚡accelerate Related to accelerate 🏋 GRPO Related to GRPO labels Feb 10, 2025
@Superskyyy
Copy link
Contributor

Superskyyy commented Feb 10, 2025

For Multi-Node it is currently not possible, but we are working on it.

Meanwhile if your training only uses <= 8carda. You can try to make vllm work on a single node while reserving two cards, and set in GRPOTrainer LLM to use tp=2. It should work.

@whw199833
Copy link

+1 I need multi-node training too. And Qwen72B cannot launch within 1 carda.

@ticosir
Copy link

ticosir commented Feb 21, 2025

+1 I need multi-node training too. help~

@yuanzhoulvpi2017
Copy link

I currently need to train a 32b model using grpo. The setup includes 8 H800 GPUs.
hope to use 6 GPUs for accelerate training and 2 GPUs for vllm(tp=2) inference. It seems that this configuration is not supported at the moment.hope this can be supported soon~

@luoruikun
Copy link

Can't wait to see the multi-node training available!

@mengban
Copy link
Contributor

mengban commented Mar 4, 2025

what? I can't believe that GRPO not support multi-mode training for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡accelerate Related to accelerate 🏋 GRPO Related to GRPO
Projects
None yet
Development

No branches or pull requests

7 participants