You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm always frustrated that I can't estimate the amount of resources the model will consume during the training of large language models, or determine whether my training configuration will lead to out-of-memory error. It's equally frustrating not knowing the minimum number of GPU cards needed, which prevents appropriate resource allocation. Running the model to solve these issues is both time-consuming and ineffective. Moreover, I desire to understand more detailed information in during the training process, such as communication information and mappings between GPU and model.
To tackle these issues, I've developed the Analysis Tool for offline analysis of memory requirements and communication data during Megatron-LM GPTModel training under hybrid parallel strategies.
What do you think of this tool?
The text was updated successfully, but these errors were encountered:
yxyOo
changed the title
[ENHANCEMENT]
[ENHANCEMENT]Analysis Tool
Sep 6, 2023
I'm always frustrated that I can't estimate the amount of resources the model will consume during the training of large language models, or determine whether my training configuration will lead to out-of-memory error. It's equally frustrating not knowing the minimum number of GPU cards needed, which prevents appropriate resource allocation. Running the model to solve these issues is both time-consuming and ineffective. Moreover, I desire to understand more detailed information in during the training process, such as communication information and mappings between GPU and model.
To tackle these issues, I've developed the Analysis Tool for offline analysis of memory requirements and communication data during Megatron-LM GPTModel training under hybrid parallel strategies.
What do you think of this tool?
The text was updated successfully, but these errors were encountered: