Reward improvement plateau after few steps #256

ali-cogni · 2025-02-09T21:19:40Z

I'm concerned that my reward function (accuracy reward) doesn't show any improvement and "learning" after 20 steps in the RL part (grpo.py) in my experiment. Could you please share your the wandb plots or confirm whether the reward is improving with reasonable pace? Additionally, my format reward always remains 0, which is very concerning. Also the length of CoT is decreasing. I hope with further steps, it rises back and reasoning emerges. So far all plots indicate(if be correct) this pipeline need serious reconsideration.

I followed the same pipeline of the repo with the default parmaters/hyperparmaters (Qwen1.5B-Instruct , BK-17k for SFT, AI-Mo for GRPO)

X-jun-0130 · 2025-02-13T11:20:38Z

This is my project, https://github.com/X-jun-0130/Simple_GRPO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reward improvement plateau after few steps #256

Reward improvement plateau after few steps #256

ali-cogni commented Feb 9, 2025 •

edited

Loading

X-jun-0130 commented Feb 13, 2025

Reward improvement plateau after few steps #256

Reward improvement plateau after few steps #256

Comments

ali-cogni commented Feb 9, 2025 • edited Loading

X-jun-0130 commented Feb 13, 2025

ali-cogni commented Feb 9, 2025 •

edited

Loading