-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SuperCLEVR测试集中Qwen2.5-VL-3B没有Qwen2-VL-2B效果好 #95
Comments
这个format reward为什么这么低?模型的指令遵循能力不应该这么差啊🤔 |
我这边跑Qwen2.5-VL-3B 6卡A100 测试300steps的ckpt 在SuperClevr上测试的准确率89.0% |
非常感谢您分享的结果。方便分享一下您的实验曲线和训练参数设置吗? |
有可能训练到后面模型的输出 thiking 太长没找到 answer?可以看看 log? 我们看到 37K 的 R1 数据量 thinking rationale 中位数可能在 1K 左右,推理的脚本的 max_new_tokens 可能需要做相应的调整。 === Is it possible that in the later stages of training, the model's "thinking" output becomes too long before reaching to an answer? Can you shar the logs? We observed that with 37K R1 data samples, the median length of thinking rationale is around 1K tokens. We may need to adjust the max_new_tokens parameter in the inference script accordingly. |
同发现了这个问题,请问有找到原因吗 |
Qwen2.5-VL-3B爆显存,有没有什么办法吗? torchrun --nproc_per_node="4" |
增加 --deepspeed local_scripts/zero3.json |
在SuperCLEVR测试集中对比Qwen2.5-VL-3B和Qwen2-VL-2B使用R1微调,发现Qwen2.5-VL-3B没有Qwen2-VL-2B效果好。
在Qwen2-VL-2B可以达到83.5%左右,而在Qwen2.5-VL-3B中仅能达到78.5%左右。
实验使用2卡 A800训练为保持数据量和作者一致,迭代了400steps
下面是Qwen2-VL-2B的微调日志:
下面是Qwen2.5-VL-3B的微调日志:
这个结论是否正确,还是复现时存在问题?
The text was updated successfully, but these errors were encountered: