different results of GRPO on qwen2-vl 2b #106

munian08 · 2025-02-17T07:01:15Z

目前在H20*8上训练了500步，曲线如下：

format_reward不断升高接近1，accuracy_reward缓慢升高，但是最高只有0.6，和官方给出的曲线不太一致
官方结果：

官方结果中accuracy_reward一开始就很高，是从0.5开始的，100步内就能到0.9，并且format_reward不是一个上升曲线，100-200步之间持续是0，对这个不一样的结果感到很困惑

训练脚本如下：
export DEBUG_MODE="true" # Enable Debug if you want to see the rollout of model during RL
export LOG_PATH="./debug_log_2b.txt"

torchrun --nproc_per_node="8"
--nnodes="1"
--node_rank="0"
--master_addr="127.0.0.1"
--master_port="12345"
src/open_r1/grpo.py
--output_dir /R1-V/rl-output
--model_name_or_path /Qwen2-VL-2B-Instruct
--dataset_name /dataset/Clevr_CoGenT_TrainA_70K
--max_prompt_length 1024
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--logging_steps 1
--bf16
--report_to wandb
--gradient_checkpointing false
--attn_implementation flash_attention_2
--max_pixels 401408
--num_train_epochs 2
--run_name Qwen2-VL-2B-GRPO-CLEVR-70k
--save_steps 100
--save_only_model true
--num_generations 8 # number of outputs G in grpo, reduce it would lead to faster training and smaller memory cost but higher variance

Syazvinski · 2025-02-23T00:05:13Z

try giving setting accuracy reward to 2, and format reward to 1. Or even try setting format reward to 0.5, and accuracy to 2, to teach model that accuracy is much more important.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

different results of GRPO on qwen2-vl 2b #106

different results of GRPO on qwen2-vl 2b #106

munian08 commented Feb 17, 2025

Syazvinski commented Feb 23, 2025

different results of GRPO on qwen2-vl 2b #106

different results of GRPO on qwen2-vl 2b #106

Comments

munian08 commented Feb 17, 2025

Syazvinski commented Feb 23, 2025