-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analysis Tool #482
base: main
Are you sure you want to change the base?
Analysis Tool #482
Conversation
This looks interesting! How accurate is it? |
Really awesome!! |
We randomly selected several parallel configurations and conducted "Memory Requirement" tests on the 7B llama2 model using a single H800 machine with eight cards. The results showed that the error was within 1% for all measurements. All other values output by the tool were theoretical. |
+1 for Really awesome! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. It is really a nice feature. However, the Readme
seems inconsistent with the implementation.
I left some comments below. Please take a look.
# Calculation Method Explanation | ||
We analyze the memory requirements of the model parameters, gradients, and optimizer states and the communication behavior of different parallel dimensions based on Megatron([1](https://arxiv.org/pdf/1909.08053.pdf), [2](https://arxiv.org/pdf/2104.04473.pdf), and [3](https://arxiv.org/pdf/2205.05198)) | ||
|
||
To estimate the memory requirements for the activation portion, given that Megatron supports FlashAttention and Fusion computations, we have adopted a distinctive approach. This method involves collecting the memory address and size information of the corresponding operations each time the cudaMalloc and cudaFree functions are executed, and then conducting line-by-line analysis of this information to derive a computational formula. To implement this method, we used the [torch.cuda.CUDAPluggableAllocator](https://pytorch.org/docs/stable/notes/cuda.html#using-custom-memory-allocators-for-cuda) to customize the memory allocator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Could you point out where did you use torch.cuda.CUDAPluggableAllocator
to estimate the activation memory? I did not find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhipeng93 It is not here. To use it, one must write a shared lib to implement the interface and set it at the beginning the python pytorch program (using ctype to load it).
tp_comm_size = tp_comm_count * s * b * h | ||
|
||
dp_comm_count = 0 if d == 1 else 2 | ||
dp_comm_size = total_parameters_per_gpu * 4 if args.bf16 else total_parameters_per_gpu * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why dp_comm_size = total_parameters_per_gpu * 4
for bf16
while *2
for fp16 and fp32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can refer to this link: https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#distributed-optimizer
7 * h + 4 * h * h / t + 3 * f * h / t + 2 * f) * per_stage_layer_num | ||
total_parameters_per_gpu_formatted = f'{int(total_parameters_per_gpu):,}' | ||
|
||
activations = n * (10 * s * h * b + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the formula of computating activation memory so that we can understand the intuition behind this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation for this formula can be found in the "Calculation Method Explanation" section
Is this verification based on the code base here or that used |
Hi @yxyOo:
Thanks |
Hi, @yxyOo this is a great feature! While my suggestion might seem a bit much, I believe it would be beneficial to use the default argument parser from training. This way, you could simply replace the training executable name with this tool and receive an analysis without additional effort (Or even print it before training if you want). Moreover, it's particularly handy for LLaMa checkpoints, as most arguments are read directly from the checkpoints ( |
Here is a simple training script that computes the "theoretical" memory usage of a model: https://github.com/NVIDIA/Megatron-LM/blob/main/compute_memory_usage.py. It re-uses the existing argument parser so we can easily do precisely what you ask for. It is under active development and should get better in the coming days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has a nice representation !
As we have gone though the patch, I believe this theoretical estimation shares common pitfall as the megatron memory report developed by @deepakn94 and use the same approach of deepspeed/onnx flops profiler for dry-run estimation.
The two ranks of SXM (A100 compute ability) for GPT alike model shows the gaps between the estimated and one : 70 (estimated) vs (20 rank0, 50 rank 1):
Model | Precision | MBS | GBS | NODES | GPU/WORKER | DP | PP | TP | Peak_Memory_Actual | Peak_Memory_Estimated | avg Error (%) |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT alike (16 layers) | bf16 | 2 | 2048 | 2 | 8 | 8 | 2 | 1 | rank#0: 23.8, rank#1: 56.3 | 70 | > 75% |
Reproduce this with this estimator:
***Memory demand on each GPU in the cluster***
==============================
Amount of Parameters: 431,403,008
Parameters: 0.8GB
Gradients: 1.6GB
Optimizers(Adam) States: 0.6GB
Activations: 72.7GB
Memory Requirement: 75.7GB
==============================
GAP analysis
Liveness of tensors
An activation can simulated with an array of liveness tensor :
using LivenessInfo = map<Key, Val>
/*
where
Key : [start_step, end_step]
Val : bytes
*/
An unary accumulated op (+=) or binary op (+) can be defined over this liveness info.
How you define "start_step" and "end_step" is dependent on the compiler. It does not work if two activation are simply added together.
Hence for non-always live (activation), a special algorithm has been explored and to be patented for static graph almost 2 years ago for static graph.
For imperative graph in megatron, since pytorch cached memory allocator will not release memory as soon as the tensor's life end, liveness plays a great role.
This means the peak memory of pytorch must be slower than what flops profiler or this memory estimator, and that megatron memory reporter profile.
I have raised an ticket for this purpose, and hope this is useful for the community.
Ranks
PP is the outer most dimension of GPUs partition groups, DP and TP are inner dimensions. We observed that memory imbalance between ranks. Experts shared GPUs in DP group and more gather/broadcast needed.
Hence you cannot simply divide the total amount of parameters needed for communication to decide which and when a gpu goes out of memory.
---------------------------------------------------------------------------------------------------------- | ||
GPTModel | ||
├─TransformerLanguageModel | ||
│ └─Embedding {space()}\t{space(pad=15)}\t{memory_mega_bytes(1.5*s*b*h)}\t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I guess many of the parameters guess for weights are hard coded, why don't we make a small named function for that estimation ? (MS deepspeed flops profiler)
Thank you for pointing out that it should be the GPT model. |
It is based on the code base. Before training your model, you can use this tool to determine the minimum amount of memory the model will consume. |
Thank you for your suggestion, I did it this way at the time to quickly develop this tool, haha. If needed, I will consider supporting related features in the future. |
if args.bf16: | ||
loss_logits_mem = 5 * s * b * v / t if p == 1 else 0 | ||
peak_mem = max( | ||
memory_giga_bytes(total_parameters_per_gpu * (1 + 2 + 2 / d + 2)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Thanks for this very useful script, but I am struggling to understand where this term (total_parameters_per_gpu * (1 + 2 + 2 / d + 2))) for the peak memory comes from. Would it be possible to get more details?
activations_per_gpu + loss_logits_mem)) | ||
gradient = total_parameters | ||
gradient_per_gpu = total_parameters_per_gpu | ||
optimizer = total_parameters * 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why optimizer = total_parameters * 6
? For each parameter in the model, AdamW keeps track of First Moment Vector and Second Moment Vector.As a result, the CUDA memory requirement for using the AdamW optimizer is approximately 2 times the memory required for the model parameters themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the super fast reply! I meant line 228, about the factor (1 + 2 + 2 / d + 2)
. The first terms 1
and 2
are params in bfloat16 and gradients in fp32 but I cannot derive 2/d +2
. Are those the optimizer states distributed in some way? Where does the second term 2
come from?
Marking as stale. No activity in 60 days. |
Note if flash attention used, memory cost is O(bhsd) not O(bhss*d). |
Marking as stale. No activity in 60 days. |
Introduction
Offline analysis of memory requirements and communication information of Megatron-LM GPTModel training under hybrid parallel strategies
Features
Given the GPT model configuration and parallel training configuration, this tool will output the following:
We randomly selected some parallel configurations and used the "Memory Requirement" output in this tool as the estimated value, and the output of torch.cuda.max_memory_allocated() in Megatron-LM report_memory after training several iterations as the actual value. The parallel configurations in the x-axis of the following figure correspond to the four model parallel configurations in the table below in order.
This can give users insight into whether their planned parallel configuration is trainable, and if it potentially could trigger OOM errors.
Calculation Method Explanation
We analyze the memory requirements of the model parameters, gradients, and optimizer states and the communication behavior of different parallel dimensions based on Megatron(1, 2, and 3)
To estimate the memory requirements for the activation portion, given that Megatron supports FlashAttention and Fusion computations, we have adopted a distinctive approach. This method involves collecting the memory address and size information of the corresponding operations each time the cudaMalloc and cudaFree functions are executed, and then conducting line-by-line analysis of this information to derive a computational formula. To implement this method, we used the torch.cuda.CUDAPluggableAllocator to customize the memory allocator.
We will observe the changes in torch.cuda.max_memory_allocated during the model training process, then summarize these changes in order to estimate peak memory.
Limitations
--bf16
,--fp16
,--use-flash-attn
,--use-distributed-optimizer
,--swiglu
--sequence-parallel
,--num_layers_per_virtual_pipeline_stage
,--recompute-activations
--use-flash-attn
,--use-distributed-optimizer
,--swiglu
,--bf16
Usage
In the
examples
directory, we've provided scripts to get pretraining GPT information. Users can generate their scripts by using the following command:The function of this command is to replace "torchrun $DISTRIBUTED_ARGS pretrain_gpt.py" with "python ../get_training_info.py $DISTRIBUTED_ARGS" in the "pretrain_gpt_distributed_with_mp.sh" which is your script for launching the training.
Moreover, we've added the following training parameters:
Example of output
Assuming there are two nodes, each equipped with eight cards, and training a model according to the above configuration, the following output will be produced.
Full Model without Parallel
Full model information without parallel training enabled.
Cluster Communication Summary
Given the model and parallel configuration, the total communication count and volume for each Pipeline Parallel, Data Parallel, and Tensor Parallel dimension in a single iteration, as well as the total communication count and volume for the entire cluster in the final training iteration.
Memory demand on each GPU in the cluster
Given the model and parallel configuration, the memory requirements on each GPU in the cluster for training one iteration.
Pipeline Parallel Communication
Data Parallel Communications
Tensor Parallel Communications