Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some problems about training time #88

Open
wefdewfwerwefestergwet4wwttqtqtqt opened this issue Mar 16, 2025 · 1 comment
Open

Some problems about training time #88

wefdewfwerwefestergwet4wwttqtqtqt opened this issue Mar 16, 2025 · 1 comment

Comments

@wefdewfwerwefestergwet4wwttqtqtqt
  1. The first issue is that the more GPUs used, the slower the fine-tuning speed becomes. When using a single 4090 (24G), the training time is 46 hours, whereas using 7 * 4090 (24G) increases the training time to 55 hours. In both cases, the GPU memory usage is around 10G.

  2. The second issue is that when fine-tuning the model, increasing the batch size leads to a longer training time. When using seven 7 * 4090, if the batch size in finetune.sh is set to:

--train_batch_size=8 \
--sample_batch_size=16 \

the training time is 55 hours. However, when the batch size is set to:

--train_batch_size=16 \
--sample_batch_size=32 \

the training time increases to 98 hours.

I’d really appreciate any insights you can share on this issue. Looking forward to your response! :)

@wefdewfwerwefestergwet4wwttqtqtqt
Copy link
Author

"It seems that batch_size is linearly related to the training time. Does this mean that for each train_step, the model is trained using the amount of data specified by the batch_size?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@wefdewfwerwefestergwet4wwttqtqtqt and others