SuperCLEVR测试集中Qwen2.5-VL-3B没有Qwen2-VL-2B效果好 #95

mxqin · 2025-02-14T07:48:18Z

在SuperCLEVR测试集中对比Qwen2.5-VL-3B和Qwen2-VL-2B使用R1微调，发现Qwen2.5-VL-3B没有Qwen2-VL-2B效果好。
在Qwen2-VL-2B可以达到83.5%左右，而在Qwen2.5-VL-3B中仅能达到78.5%左右。

实验使用2卡 A800训练为保持数据量和作者一致，迭代了400steps
下面是Qwen2-VL-2B的微调日志：

下面是Qwen2.5-VL-3B的微调日志：

这个结论是否正确，还是复现时存在问题？

Yukino256 · 2025-02-14T09:15:29Z

这个format reward为什么这么低？模型的指令遵循能力不应该这么差啊🤔

mxqin · 2025-02-14T10:17:59Z

这个format reward为什么这么低？模型的指令遵循能力不应该这么差啊🤔
输出是有格式的，基本都是正确的格式。

可能跟随机选的数据相关，重新训练格式奖励比较高。但是结果还是一样的，不如qwen2高

LYA-Ansel · 2025-02-14T14:33:54Z

我这边跑Qwen2.5-VL-3B 6卡A100 测试300steps的ckpt 在SuperClevr上测试的准确率89.0%
但是用后续400-2500step的ckpt测试大部分准确率都集中在70-80多最低54% 训练更多step 准确率并没有更好（但都比 2.5VL-2B、7B、72B要好）
你用的数据量是37.8k rows吗？或者把save_steps调小一点看看能不能找到一个准确率高的ckpt?

mxqin · 2025-02-14T14:54:56Z

我这边跑Qwen2.5-VL-3B 6卡A100 测试300steps的ckpt 在SuperClevr上测试的准确率89.0% 但是用后续400-2500step的ckpt测试大部分准确率都集中在70-80多最低54% 训练更多step 准确率并没有更好（但都比 2.5VL-2B、7B、72B要好）你用的数据量是37.8k rows吗？或者把save_steps调小一点看看能不能找到一个准确率高的ckpt?

非常感谢您分享的结果。方便分享一下您的实验曲线和训练参数设置吗？

TobiasLee · 2025-02-16T08:40:35Z

有可能训练到后面模型的输出 thiking 太长没找到 answer?可以看看 log? 我们看到 37K 的 R1 数据量 thinking rationale 中位数可能在 1K 左右，推理的脚本的 max_new_tokens 可能需要做相应的调整。

===

Is it possible that in the later stages of training, the model's "thinking" output becomes too long before reaching to an answer? Can you shar the logs? We observed that with 37K R1 data samples, the median length of thinking rationale is around 1K tokens. We may need to adjust the max_new_tokens parameter in the inference script accordingly.

munian08 · 2025-02-17T09:36:01Z

这个format reward为什么这么低？模型的指令遵循能力不应该这么差啊🤔
输出是有格式的，基本都是正确的格式。

可能跟随机选的数据相关，重新训练格式奖励比较高。但是结果还是一样的，不如qwen2高

请问“重新训练”的参数有改变吗，看上去两次训练日志中format_reward差别很大

Jia-py · 2025-02-18T08:24:39Z

同发现了这个问题，请问有找到原因吗

lzk9508 · 2025-02-20T09:03:15Z

我这边跑Qwen2.5-VL-3B 6卡A100 测试300steps的ckpt 在SuperClevr上测试的准确率89.0% 但是用后续400-2500step的ckpt测试大部分准确率都集中在70-80多最低54% 训练更多step 准确率并没有更好（但都比 2.5VL-2B、7B、72B要好）你用的数据量是37.8k rows吗？或者把save_steps调小一点看看能不能找到一个准确率高的ckpt?

Qwen2.5-VL-3B爆显存，有没有什么办法吗？
这样的脚本都会爆：
export DEBUG_MODE="true"
export LOG_PATH="./debug_log_2b.txt"
export WANDB_MODE=offline

torchrun --nproc_per_node="4"
--nnodes="1"
--node_rank="0"
--master_addr="127.0.0.1"
--master_port="12345"
src/open_r1/grpo.py
--output_dir "./output/"
--model_name_or_path "/245_disk/Qwen2.5-VL-3B-Instruct/"
--dataset_name "/245_disk/mb_train_sft_1121/my_local_dataset"
--max_prompt_length 1024
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--logging_steps 1
--bf16
--report_to wandb
--gradient_checkpointing false
--attn_implementation flash_attention_2
--max_pixels 401408
--num_train_epochs 2
--run_name Qwen2-VL-3B-GRPO-classsify
--save_steps 100
--save_only_model true
--num_generations 2 \

mxqin · 2025-02-20T11:12:57Z

我这边跑Qwen2.5-VL-3B 6卡A100 测试300steps的ckpt 在SuperClevr上测试的准确率89.0% 但是用后续400-2500step的ckpt测试大部分准确率都集中在70-80多最低54% 训练更多step 准确率并没有更好（但都比 2.5VL-2B、7B、72B要好）你用的数据量是37.8k rows吗？或者把save_steps调小一点看看能不能找到一个准确率高的ckpt?

Qwen2.5-VL-3B爆显存，有没有什么办法吗？这样的脚本都会爆： export DEBUG_MODE="true" export LOG_PATH="./debug_log_2b.txt" export WANDB_MODE=offline

torchrun --nproc_per_node="4" --nnodes="1" --node_rank="0" --master_addr="127.0.0.1" --master_port="12345" src/open_r1/grpo.py --output_dir "./output/" --model_name_or_path "/245_disk/Qwen2.5-VL-3B-Instruct/" --dataset_name "/245_disk/mb_train_sft_1121/my_local_dataset" --max_prompt_length 1024 --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --logging_steps 1 --bf16 --report_to wandb --gradient_checkpointing false --attn_implementation flash_attention_2 --max_pixels 401408 --num_train_epochs 2 --run_name Qwen2-VL-3B-GRPO-classsify --save_steps 100 --save_only_model true --num_generations 2 \

增加 --deepspeed local_scripts/zero3.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SuperCLEVR测试集中Qwen2.5-VL-3B没有Qwen2-VL-2B效果好 #95

SuperCLEVR测试集中Qwen2.5-VL-3B没有Qwen2-VL-2B效果好 #95

mxqin commented Feb 14, 2025

Yukino256 commented Feb 14, 2025

mxqin commented Feb 14, 2025

LYA-Ansel commented Feb 14, 2025 •

edited

Loading

mxqin commented Feb 14, 2025

TobiasLee commented Feb 16, 2025

munian08 commented Feb 17, 2025

Jia-py commented Feb 18, 2025

lzk9508 commented Feb 20, 2025

mxqin commented Feb 20, 2025

SuperCLEVR测试集中Qwen2.5-VL-3B没有Qwen2-VL-2B效果好 #95

SuperCLEVR测试集中Qwen2.5-VL-3B没有Qwen2-VL-2B效果好 #95

Comments

mxqin commented Feb 14, 2025

Yukino256 commented Feb 14, 2025

mxqin commented Feb 14, 2025

LYA-Ansel commented Feb 14, 2025 • edited Loading

mxqin commented Feb 14, 2025

TobiasLee commented Feb 16, 2025

munian08 commented Feb 17, 2025

Jia-py commented Feb 18, 2025

lzk9508 commented Feb 20, 2025

mxqin commented Feb 20, 2025

LYA-Ansel commented Feb 14, 2025 •

edited

Loading