You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From the current implementation, vLLM might be used to generate sample. However, the samples should be generated by the optimized model instead of the original model.
Another problem is here:
The per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) is always adavantages.
=>you are wrong; the code is for backward gradient
if self.args.use_vllm: # First, have main process load weights if needed if self.state.global_step != self._last_loaded_step: with unwrap_model_for_generation(model, self.accelerator) as unwrapped_model: state_dict = unwrapped_model.state_dict() if self.accelerator.is_main_process: llm_model = self.llm.llm_engine.model_executor.driver_worker.model_runner.model llm_model.load_weights(state_dict.items()) self._last_loaded_step = self.state.global_step
so the weight is newer
I think you really should read the code carefully and then give the issue
The per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) is always adavantages. =>you are wrong; the code is for backward gradient
I am a little bit confused, why there is a per_token_logps - per_token_logps.detach(), it is actually zero. Taking it one exp makes it 1.
Reproduction
From the current implementation, vLLM might be used to generate sample. However, the samples should be generated by the optimized model instead of the original model.
Another problem is here:
trl/trl/trainer/grpo_trainer.py
Line 644 in 09eefa7
The per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) is always adavantages.
System Info
It's the latest TRL main branch.
Checklist
The text was updated successfully, but these errors were encountered: