Some problems about training time #88

wefdewfwerwefestergwet4wwttqtqtqt · 2025-03-16T12:41:21Z

The first issue is that the more GPUs used, the slower the fine-tuning speed becomes. When using a single 4090 (24G), the training time is 46 hours, whereas using 7 * 4090 (24G) increases the training time to 55 hours. In both cases, the GPU memory usage is around 10G.
The second issue is that when fine-tuning the model, increasing the batch size leads to a longer training time. When using seven 7 * 4090, if the batch size in finetune.sh is set to:

--train_batch_size=8 \
--sample_batch_size=16 \

the training time is 55 hours. However, when the batch size is set to:

--train_batch_size=16 \
--sample_batch_size=32 \

the training time increases to 98 hours.

I’d really appreciate any insights you can share on this issue. Looking forward to your response! ：）

The text was updated successfully, but these errors were encountered:

wefdewfwerwefestergwet4wwttqtqtqt · 2025-03-16T13:20:20Z

"It seems that batch_size is linearly related to the training time. Does this mean that for each train_step, the model is trained using the amount of data specified by the batch_size?"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some problems about training time #88

Some problems about training time #88

wefdewfwerwefestergwet4wwttqtqtqt commented Mar 16, 2025

wefdewfwerwefestergwet4wwttqtqtqt commented Mar 16, 2025

Some problems about training time #88

Some problems about training time #88

Comments

wefdewfwerwefestergwet4wwttqtqtqt commented Mar 16, 2025

wefdewfwerwefestergwet4wwttqtqtqt commented Mar 16, 2025