Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPO implementation issue: VLLM usage may hurt performance. #2802

Open
5 tasks done
yynil opened this issue Feb 8, 2025 · 4 comments
Open
5 tasks done

GRPO implementation issue: VLLM usage may hurt performance. #2802

yynil opened this issue Feb 8, 2025 · 4 comments
Labels
🐛 bug Something isn't working 🏋 GRPO Related to GRPO

Comments

@yynil
Copy link

yynil commented Feb 8, 2025

Reproduction

From the current implementation, vLLM might be used to generate sample. However, the samples should be generated by the optimized model instead of the original model.
Another problem is here:

per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)

The per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) is always adavantages.

System Info

It's the latest TRL main branch.

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete
@github-actions github-actions bot added 🏋 GRPO Related to GRPO 🐛 bug Something isn't working labels Feb 8, 2025
@yiyepiaoling0715
Copy link

The per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) is always adavantages.
=>you are wrong; the code is for backward gradient

@yiyepiaoling0715
Copy link

if self.args.use_vllm: # First, have main process load weights if needed if self.state.global_step != self._last_loaded_step: with unwrap_model_for_generation(model, self.accelerator) as unwrapped_model: state_dict = unwrapped_model.state_dict() if self.accelerator.is_main_process: llm_model = self.llm.llm_engine.model_executor.driver_worker.model_runner.model llm_model.load_weights(state_dict.items()) self._last_loaded_step = self.state.global_step
so the weight is newer
I think you really should read the code carefully and then give the issue

@Superskyyy
Copy link
Contributor

No the vllm uses the current policy with weight continuously updated. The implementation is correct.

@linkedlist771
Copy link
Contributor

The per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) is always adavantages. =>you are wrong; the code is for backward gradient

I am a little bit confused, why there is a per_token_logps - per_token_logps.detach(), it is actually zero. Taking it one exp makes it 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working 🏋 GRPO Related to GRPO
Projects
None yet
Development

No branches or pull requests

4 participants