Some questions on scripts and runtime #68

kevin3567 · 2024-08-29T20:15:47Z

Hi,

I have a few questions about the scripts and runtime:

What is the execution time of your experiment when using 112 A100 GPUs?
I saw the script scripts/cpt/fpt.sh, which uses 1 node with 8 GPUs. Is this for pretraining a llama-moe as well? If so, what is the execution time of this experiment?
I am wondering if there are any way to run the pretraining on 2 (or even 1) GPUs for proof-of-concept purposes. Reducing the architecture size is probably the first thing that should be tried, but I am wondering if you have any experience with model pretraining on a low-resources settings.

Thanks in advance.

Spico197 · 2024-10-08T13:43:21Z

Hi there, sorry for the late response. Thank you very much for your attention to our project ❤️

For LLaMA-MoE-3.5B (2/8), it costs about 1 week to reproduce the experiment with 112 A100 GPUs.
It is used for testing purposes and does not reproduce the results reported in the report.
In this case, I would recommend testing on smaller LLMs, maybe SmolLM series is a good choice for your convenience. In that case, you don't have to train on 200B tokens, and the training time would be highly reduced.

Provide feedback