You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am training the GRPO model using the zero3 method. At the beginning, the reward was normal and even increasing. However, by the end of the training, the reward became 0, and the KL divergence became extremely large. What could be the reason? Below are some changes in the reward during my training.
train config:
# Model arguments
model_name_or_path: /home/base-model/deepseek-r1-distill-qwen-1.5b
model_revision: main
torch_dtype: bfloat16
# Num processes is less by 1 as vLLM is using 1 GPU
num_processes: 4
# GRPO trainer config
gradient_accumulation_steps: 2
per_device_train_batch_size: 4
num_generations: 8
I am training the GRPO model using the zero3 method. At the beginning, the reward was normal and even increasing. However, by the end of the training, the reward became 0, and the KL divergence became extremely large. What could be the reason? Below are some changes in the reward during my training.
train config:
train log
The text was updated successfully, but these errors were encountered: