Why will the final accuracy_reward be 0? #257

hellen9527 · 2025-02-10T01:43:35Z

I am training the GRPO model using the zero3 method. At the beginning, the reward was normal and even increasing. However, by the end of the training, the reward became 0, and the KL divergence became extremely large. What could be the reason? Below are some changes in the reward during my training.
train config：

# Model arguments
model_name_or_path: /home/base-model/deepseek-r1-distill-qwen-1.5b
model_revision: main
torch_dtype: bfloat16

# Num processes is less by 1 as vLLM is using 1 GPU
num_processes: 4

# GRPO trainer config
gradient_accumulation_steps: 2
per_device_train_batch_size: 4
num_generations: 8

train log

{'loss': 0.0, 'grad_norm': 0.9503731084601761, 'learning_rate': 5.521811154058532e-07, 'rewards/accuracy_reward': 0.275, 'rewards/format_reward': 0.01875, 'rewards/reasoning_steps_reward': 0.6427083551883698, 'rewards/cosine_scaled_reward': -0.03994722058996558, 'reward': 0.896511122584343, 'reward_std': 0.8323002576828002, 'completion_length': 849.65625, 'kl': 8.575916290283203e-05, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.4910426002806111, 'learning_rate': 6.626173384870238e-07, 'rewards/accuracy_reward': 0.1625, 'rewards/format_reward': 0.01875, 'rewards/reasoning_steps_reward': 0.6739583507180213, 'rewards/cosine_scaled_reward': -0.14418444326147437, 'reward': 0.7110238954424858, 'reward_std': 0.7402671471238136, 'completion_length': 887.85, 'kl': 0.00010432004928588867, 'epoch': 0.0}
...
{'loss': 0.0002, 'grad_norm': 0.320330599760996, 'learning_rate': 1.7669795692987302e-06, 'rewards/accuracy_reward': 0.3625, 'rewards/format_reward': 0.003125, 'rewards/reasoning_steps_reward': 0.8802083492279053, 'rewards/cosine_scaled_reward': 0.0801055665127933, 'reward': 1.3259388893842696, 'reward_std': 0.724829213321209, 'completion_length': 816.81875, 'kl': 0.00401153564453125, 'epoch': 0.01}
{'loss': 0.0002, 'grad_norm': 0.28178907825908484, 'learning_rate': 1.8774157923799008e-06, 'rewards/accuracy_reward': 0.4125, 'rewards/format_reward': 0.003125, 'rewards/reasoning_steps_reward': 0.8916666805744171, 'rewards/cosine_scaled_reward': 0.14045850289985537, 'reward': 1.4477501690387726, 'reward_std': 0.7019044987857341, 'completion_length': 811.49375, 'kl': 0.005374908447265625, 'epoch': 0.01}
...
{'loss': 0.1872, 'grad_norm': 1.0184187281488377e-05, 'learning_rate': 7.430315994594317e-11, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6666666865348816, 'rewards/cosine_scaled_reward': -0.0005968734039925039, 'reward': 0.6660698056221008, 'reward_std': 0.0, 'completion_length': 7.0, 'kl': 4.68125, 'epoch': 1.0}
{'loss': 0.1871, 'grad_norm': 7.792120717234956e-06, 'learning_rate': 1.857580723907404e-11, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6666666865348816, 'rewards/cosine_scaled_reward': -0.0005968734039925039, 'reward': 0.6660698056221008, 'reward_std': 0.0, 'completion_length': 7.0, 'kl': 4.678125, 'epoch': 1.0}
{'loss': 0.1848, 'grad_norm': 7.826190411683244e-06, 'learning_rate': 0.0, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6666666865348816, 'rewards/cosine_scaled_reward': -0.0005968734039925039, 'reward': 0.6660698056221008, 'reward_std': 0.0, 'completion_length': 7.0, 'kl': 4.61875, 'epoch': 1.0}

The text was updated successfully, but these errors were encountered:

AchoWu · 2025-02-12T09:21:11Z

I observed the same phenomenon.

xjy233 · 2025-02-24T06:34:43Z

我也一样,训练到一定程度就开始下降直到0,你有解决吗?

AchoWu · 2025-02-24T12:21:31Z

如果是你在训练开始阶段就下降到0，可能是因为你的acc奖励设置的有问题，你应该尝试修改./src/open_r1/rewards.py中的accuracy_reward函数。
如果你是在训练数百个step后发现reward下降到0，可能是因为这时回复太长还没有输出结果就被截断了，你可以打印一下输出。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why will the final accuracy_reward be 0? #257

Why will the final accuracy_reward be 0? #257

hellen9527 commented Feb 10, 2025 •

edited

Loading

AchoWu commented Feb 12, 2025

xjy233 commented Feb 24, 2025

AchoWu commented Feb 24, 2025

Why will the final accuracy_reward be 0? #257

Why will the final accuracy_reward be 0? #257

Comments

hellen9527 commented Feb 10, 2025 • edited Loading

AchoWu commented Feb 12, 2025

xjy233 commented Feb 24, 2025

AchoWu commented Feb 24, 2025

hellen9527 commented Feb 10, 2025 •

edited

Loading