You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm concerned that my reward function (accuracy reward) doesn't show any improvement and "learning" after 20 steps in the RL part (grpo.py) in my experiment. Could you please share your the wandb plots or confirm whether the reward is improving with reasonable pace? Additionally, my format reward always remains 0, which is very concerning. Also the length of CoT is decreasing. I hope with further steps, it rises back and reasoning emerges. So far all plots indicate(if be correct) this pipeline need serious reconsideration.
I followed the same pipeline of the repo with the default parmaters/hyperparmaters (Qwen1.5B-Instruct , BK-17k for SFT, AI-Mo for GRPO)
The text was updated successfully, but these errors were encountered:
I'm concerned that my reward function (accuracy reward) doesn't show any improvement and "learning" after 20 steps in the RL part (grpo.py) in my experiment. Could you please share your the wandb plots or confirm whether the reward is improving with reasonable pace? Additionally, my format reward always remains 0, which is very concerning. Also the length of CoT is decreasing. I hope with further steps, it rises back and reasoning emerges. So far all plots indicate(if be correct) this pipeline need serious reconsideration.
I followed the same pipeline of the repo with the default parmaters/hyperparmaters (Qwen1.5B-Instruct , BK-17k for SFT, AI-Mo for GRPO)
The text was updated successfully, but these errors were encountered: