You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was training Qwen2-VL-2B-instruct on 8 80G GPU (7 for training and 1 for vllm). The training dataset is the authors' provided GEOQA_R1V_Train_8K dataset (including 8,031 samples in total).
I set the per_device_train_batch_size=1, gradient_accumulation_steps=4 and num_train_epochs=1. In my understanding, the global train batch size would be 1*4*7=28 and the total training steps should be 8031*1/28≈286.82. But the training log gives me a total training step of 2007:
I was training Qwen2-VL-2B-instruct on 8 80G GPU (7 for training and 1 for vllm). The training dataset is the authors' provided GEOQA_R1V_Train_8K dataset (including 8,031 samples in total).
I set the
per_device_train_batch_size=1
,gradient_accumulation_steps=4
andnum_train_epochs=1
. In my understanding, the global train batch size would be1*4*7=28
and the total training steps should be8031*1/28≈286.82
. But the training log gives me a total training step of 2007:Is there something wrong with the python script? or Did I get it wrong?
My training scripts is:
The text was updated successfully, but these errors were encountered: