Time breakdown analysis tool #617

Ethan-yt · 2023-12-05T08:46:27Z

This pull request introduces a utility that breaks down the time spent in a single iteration, offering a powerful means of projecting theoretical time in model training. This tool enables users to derive theoretical FLOPs, enabling the optimization of hyper-parameter adjustments for maximum performance, thereby eliminating the need for manual experimentation.

Some of the calculation methodologies employed in this tool are inspired by #482.

Please note the following limitations:

It is primarily based on llama models and has not been tested yet on other architectures.
It utilizes common settings such as bf16 but currently does not support gradient checkpoint and other features.

If any inconsistency arises between the provided formula and the actual training process, feel free to correct me. Thank you.

example output:

notation                                          description  value
       a Number of microbatches / gradient accumulation steps    120
       b                                     Global batchsize   1200
       s                                      Sequence length   4096
       h                                          Hidden size   8192
       i                  Intermediate size / FFN hidden size  28672
       l                                     Number of layers     80
       v                                           Vocab size  32256
      nq                      Number of query attention heads     64
     nkv                  Number of key/value attention heads      8
       d                                   Data parallel size     10
       t                                 Tensor parallel size      4
       p                               Pipeline parallel size      4

Model parameters:
               name          param
          Attention 12,079,595,520
                FFN 56,371,445,760
          Layernorm      1,318,912
Embedding & LM Head    528,482,304
              Total 68,980,842,496

Total FLOPs per iteration
                                name                     flops
Attention Q, K, V, O Transformations   356,241,767,399,424,000
             Attention Score, Values   158,329,674,399,744,000
                                 FFN 1,662,461,581,197,312,000
                             LM Head     7,792,788,661,862,400
                               Total 2,184,825,811,658,342,400

Communication per GPU:
parallelism  size(GB)  count  time(seconds)
         TP    12.080 80.000          5.685
         DP    15.521  1.000          0.183
         PP     8.053  4.000          1.516

Time breakdown:
              name  time(seconds)
Forward / Backward         27.310
         PP Bubble          0.683
  PP Communication          1.516
  TP Communication          5.685
  DP Communication          0.183
             Total         35.376

Estimated FLOPs per second: 386.000 TFLOPs

Ethan-yt · 2023-12-07T03:57:38Z

@deepakn94 @zhipeng93

Ethan-yt · 2024-01-08T10:01:55Z

ping @jaredcasper @jon-barker

github-actions · 2024-03-08T18:20:18Z

Marking as stale. No activity in 60 days.

github-actions bot added the stale No activity in 60 days on issue or PR label Mar 8, 2024

Ethan-yt closed this May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time breakdown analysis tool #617

Time breakdown analysis tool #617

Ethan-yt commented Dec 5, 2023

Ethan-yt commented Dec 7, 2023

Ethan-yt commented Jan 8, 2024

github-actions bot commented Mar 8, 2024

Time breakdown analysis tool #617

Time breakdown analysis tool #617

Conversation

Ethan-yt commented Dec 5, 2023

Ethan-yt commented Dec 7, 2023

Ethan-yt commented Jan 8, 2024

github-actions bot commented Mar 8, 2024