Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why will the final accuracy_reward be 0? #257

Open
hellen9527 opened this issue Feb 10, 2025 · 3 comments
Open

Why will the final accuracy_reward be 0? #257

hellen9527 opened this issue Feb 10, 2025 · 3 comments

Comments

@hellen9527
Copy link

hellen9527 commented Feb 10, 2025

I am training the GRPO model using the zero3 method. At the beginning, the reward was normal and even increasing. However, by the end of the training, the reward became 0, and the KL divergence became extremely large. What could be the reason? Below are some changes in the reward during my training.
train config:

# Model arguments
model_name_or_path: /home/base-model/deepseek-r1-distill-qwen-1.5b
model_revision: main
torch_dtype: bfloat16

# Num processes is less by 1 as vLLM is using 1 GPU
num_processes: 4

# GRPO trainer config
gradient_accumulation_steps: 2
per_device_train_batch_size: 4
num_generations: 8

train log

{'loss': 0.0, 'grad_norm': 0.9503731084601761, 'learning_rate': 5.521811154058532e-07, 'rewards/accuracy_reward': 0.275, 'rewards/format_reward': 0.01875, 'rewards/reasoning_steps_reward': 0.6427083551883698, 'rewards/cosine_scaled_reward': -0.03994722058996558, 'reward': 0.896511122584343, 'reward_std': 0.8323002576828002, 'completion_length': 849.65625, 'kl': 8.575916290283203e-05, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.4910426002806111, 'learning_rate': 6.626173384870238e-07, 'rewards/accuracy_reward': 0.1625, 'rewards/format_reward': 0.01875, 'rewards/reasoning_steps_reward': 0.6739583507180213, 'rewards/cosine_scaled_reward': -0.14418444326147437, 'reward': 0.7110238954424858, 'reward_std': 0.7402671471238136, 'completion_length': 887.85, 'kl': 0.00010432004928588867, 'epoch': 0.0}
...
{'loss': 0.0002, 'grad_norm': 0.320330599760996, 'learning_rate': 1.7669795692987302e-06, 'rewards/accuracy_reward': 0.3625, 'rewards/format_reward': 0.003125, 'rewards/reasoning_steps_reward': 0.8802083492279053, 'rewards/cosine_scaled_reward': 0.0801055665127933, 'reward': 1.3259388893842696, 'reward_std': 0.724829213321209, 'completion_length': 816.81875, 'kl': 0.00401153564453125, 'epoch': 0.01}
{'loss': 0.0002, 'grad_norm': 0.28178907825908484, 'learning_rate': 1.8774157923799008e-06, 'rewards/accuracy_reward': 0.4125, 'rewards/format_reward': 0.003125, 'rewards/reasoning_steps_reward': 0.8916666805744171, 'rewards/cosine_scaled_reward': 0.14045850289985537, 'reward': 1.4477501690387726, 'reward_std': 0.7019044987857341, 'completion_length': 811.49375, 'kl': 0.005374908447265625, 'epoch': 0.01}
...
{'loss': 0.1872, 'grad_norm': 1.0184187281488377e-05, 'learning_rate': 7.430315994594317e-11, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6666666865348816, 'rewards/cosine_scaled_reward': -0.0005968734039925039, 'reward': 0.6660698056221008, 'reward_std': 0.0, 'completion_length': 7.0, 'kl': 4.68125, 'epoch': 1.0}
{'loss': 0.1871, 'grad_norm': 7.792120717234956e-06, 'learning_rate': 1.857580723907404e-11, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6666666865348816, 'rewards/cosine_scaled_reward': -0.0005968734039925039, 'reward': 0.6660698056221008, 'reward_std': 0.0, 'completion_length': 7.0, 'kl': 4.678125, 'epoch': 1.0}
{'loss': 0.1848, 'grad_norm': 7.826190411683244e-06, 'learning_rate': 0.0, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6666666865348816, 'rewards/cosine_scaled_reward': -0.0005968734039925039, 'reward': 0.6660698056221008, 'reward_std': 0.0, 'completion_length': 7.0, 'kl': 4.61875, 'epoch': 1.0}
@AchoWu
Copy link

AchoWu commented Feb 12, 2025

I observed the same phenomenon.

@xjy233
Copy link

xjy233 commented Feb 24, 2025

我也一样,训练到一定程度就开始下降直到0,你有解决吗?

@AchoWu
Copy link

AchoWu commented Feb 24, 2025

如果是你在训练开始阶段就下降到0,可能是因为你的acc奖励设置的有问题,你应该尝试修改./src/open_r1/rewards.py中的accuracy_reward函数。
如果你是在训练数百个step后发现reward下降到0,可能是因为这时回复太长还没有输出结果就被截断了,你可以打印一下输出。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants