You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a few questions about the scripts and runtime:
What is the execution time of your experiment when using 112 A100 GPUs?
I saw the script scripts/cpt/fpt.sh, which uses 1 node with 8 GPUs. Is this for pretraining a llama-moe as well? If so, what is the execution time of this experiment?
I am wondering if there are any way to run the pretraining on 2 (or even 1) GPUs for proof-of-concept purposes. Reducing the architecture size is probably the first thing that should be tried, but I am wondering if you have any experience with model pretraining on a low-resources settings.
Thanks in advance.
The text was updated successfully, but these errors were encountered:
Hi there, sorry for the late response. Thank you very much for your attention to our project ❤️
For LLaMA-MoE-3.5B (2/8), it costs about 1 week to reproduce the experiment with 112 A100 GPUs.
It is used for testing purposes and does not reproduce the results reported in the report.
In this case, I would recommend testing on smaller LLMs, maybe SmolLM series is a good choice for your convenience. In that case, you don't have to train on 200B tokens, and the training time would be highly reduced.
Hi,
I have a few questions about the scripts and runtime:
Thanks in advance.
The text was updated successfully, but these errors were encountered: