You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The first issue is that the more GPUs used, the slower the fine-tuning speed becomes. When using a single 4090 (24G), the training time is 46 hours, whereas using 7 * 4090 (24G) increases the training time to 55 hours. In both cases, the GPU memory usage is around 10G.
The second issue is that when fine-tuning the model, increasing the batch size leads to a longer training time. When using seven 7 * 4090, if the batch size in finetune.sh is set to:
--train_batch_size=8 \
--sample_batch_size=16 \
the training time is 55 hours. However, when the batch size is set to:
--train_batch_size=16 \
--sample_batch_size=32 \
the training time increases to 98 hours.
I’d really appreciate any insights you can share on this issue. Looking forward to your response! :)
The text was updated successfully, but these errors were encountered:
"It seems that batch_size is linearly related to the training time. Does this mean that for each train_step, the model is trained using the amount of data specified by the batch_size?"
The first issue is that the more GPUs used, the slower the fine-tuning speed becomes. When using a single 4090 (24G), the training time is 46 hours, whereas using 7 * 4090 (24G) increases the training time to 55 hours. In both cases, the GPU memory usage is around 10G.
The second issue is that when fine-tuning the model, increasing the batch size leads to a longer training time. When using seven 7 * 4090, if the batch size in
finetune.sh
is set to:the training time is 55 hours. However, when the batch size is set to:
the training time increases to 98 hours.
I’d really appreciate any insights you can share on this issue. Looking forward to your response! :)
The text was updated successfully, but these errors were encountered: