Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different results of GRPO on qwen2-vl 2b #106

Open
munian08 opened this issue Feb 17, 2025 · 1 comment
Open

different results of GRPO on qwen2-vl 2b #106

munian08 opened this issue Feb 17, 2025 · 1 comment

Comments

@munian08
Copy link

目前在H20*8上训练了500步,曲线如下:

Image

format_reward不断升高接近1,accuracy_reward缓慢升高,但是最高只有0.6,和官方给出的曲线不太一致
官方结果:
Image

官方结果中accuracy_reward一开始就很高,是从0.5开始的,100步内就能到0.9,并且format_reward不是一个上升曲线,100-200步之间持续是0,对这个不一样的结果感到很困惑

训练脚本如下:
export DEBUG_MODE="true" # Enable Debug if you want to see the rollout of model during RL
export LOG_PATH="./debug_log_2b.txt"

torchrun --nproc_per_node="8"
--nnodes="1"
--node_rank="0"
--master_addr="127.0.0.1"
--master_port="12345"
src/open_r1/grpo.py
--output_dir /R1-V/rl-output
--model_name_or_path /Qwen2-VL-2B-Instruct
--dataset_name /dataset/Clevr_CoGenT_TrainA_70K
--max_prompt_length 1024
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--logging_steps 1
--bf16
--report_to wandb
--gradient_checkpointing false
--attn_implementation flash_attention_2
--max_pixels 401408
--num_train_epochs 2
--run_name Qwen2-VL-2B-GRPO-CLEVR-70k
--save_steps 100
--save_only_model true
--num_generations 8 # number of outputs G in grpo, reduce it would lead to faster training and smaller memory cost but higher variance

@Syazvinski
Copy link

try giving setting accuracy reward to 2, and format reward to 1. Or even try setting format reward to 0.5, and accuracy to 2, to teach model that accuracy is much more important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants