You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to train a gpt2 large(774M) model on V100-32GB GPU, however even this model is not big, I cant' fit it into a single gpu. It will always show this error, attache with my terminal output.
"torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 786.00 MiB (GPU 0; 31.74 GiB total capacity; 28.82 GiB already allocated; 577.12 MiB free; 29.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"
Below is my training config and model size calculation. Based on my calculation, and especially with the help of @yxyOo from this tool, I figured out this model should only need 23.3GB which is much smaller than 32GB GPU memory. However, I still encounter OOM errors. So, I'm confused why Megatron need so much memory during training.
Besides, I didn't use use-flash-atten method to save memory because I can only access to V100 GPU which is not support it. Is this a reason why memory bigger than theoretical situation?
The default output result of this tool is set to "use-flash-atten", which does not match your usage scenario.
Please refer to the "Limitations" section in Analysis Tool.
I'm trying to train a gpt2 large(774M) model on V100-32GB GPU, however even this model is not big, I cant' fit it into a single gpu. It will always show this error, attache with my terminal output.
Below is my training config and model size calculation. Based on my calculation, and especially with the help of @yxyOo from this tool, I figured out this model should only need 23.3GB which is much smaller than 32GB GPU memory. However, I still encounter OOM errors. So, I'm confused why Megatron need so much memory during training.
Besides, I didn't use
use-flash-atten
method to save memory because I can only access to V100 GPU which is not support it. Is this a reason why memory bigger than theoretical situation?Model Memory Size
The text was updated successfully, but these errors were encountered: