You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Not sure if this is a bug per se, it's possible that I'm missing something here. If this is a known fixable issue, it would be great to note it in the GRPOTrainer docs.
Reproduction
It seems like grad_norm eventually starts to become unstable and spike higher and higher after >10k steps in GRPO.
I've tried both single-GPU and multi-GPU training, various values for beta and batch sizes/learning rates. I have not tested if this problem happens without vLLM, though.
Weirdly enough, this does not affect the reward, which keeps going up, but only the grad_norm and clip_ratio metrics.
I was wrong, I had beta=0.0 in all my experiments. Setting beta=0.001 was enough to prevent the gradient explosion. Perhaps we shouldn't suggest that option in the docs so prominently?
KL coefficient. If `0.0`, the reference model is not loaded, reducing memory usage and improving training speed, but may be numerically unstable for long training runs.
Not sure if this is a bug per se, it's possible that I'm missing something here. If this is a known fixable issue, it would be great to note it in the GRPOTrainer docs.
Reproduction
It seems like
grad_norm
eventually starts to become unstable and spike higher and higher after >10k steps in GRPO.I've tried both single-GPU and multi-GPU training, various
values forand batch sizes/learning rates. I have not tested if this problem happens without vLLM, though.beta
Weirdly enough, this does not affect the reward, which keeps going up, but only the
grad_norm
andclip_ratio
metrics.Training args (the relevant ones):
System Info
Checklist
The text was updated successfully, but these errors were encountered: