Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time breakdown analysis tool #617

Closed
wants to merge 0 commits into from
Closed

Time breakdown analysis tool #617

wants to merge 0 commits into from

Conversation

Ethan-yt
Copy link

@Ethan-yt Ethan-yt commented Dec 5, 2023

This pull request introduces a utility that breaks down the time spent in a single iteration, offering a powerful means of projecting theoretical time in model training. This tool enables users to derive theoretical FLOPs, enabling the optimization of hyper-parameter adjustments for maximum performance, thereby eliminating the need for manual experimentation.

Some of the calculation methodologies employed in this tool are inspired by #482.

Please note the following limitations:

  1. It is primarily based on llama models and has not been tested yet on other architectures.
  2. It utilizes common settings such as bf16 but currently does not support gradient checkpoint and other features.

If any inconsistency arises between the provided formula and the actual training process, feel free to correct me. Thank you.

example output:

notation                                          description  value
       a Number of microbatches / gradient accumulation steps    120
       b                                     Global batchsize   1200
       s                                      Sequence length   4096
       h                                          Hidden size   8192
       i                  Intermediate size / FFN hidden size  28672
       l                                     Number of layers     80
       v                                           Vocab size  32256
      nq                      Number of query attention heads     64
     nkv                  Number of key/value attention heads      8
       d                                   Data parallel size     10
       t                                 Tensor parallel size      4
       p                               Pipeline parallel size      4

Model parameters:
               name          param
          Attention 12,079,595,520
                FFN 56,371,445,760
          Layernorm      1,318,912
Embedding & LM Head    528,482,304
              Total 68,980,842,496

Total FLOPs per iteration
                                name                     flops
Attention Q, K, V, O Transformations   356,241,767,399,424,000
             Attention Score, Values   158,329,674,399,744,000
                                 FFN 1,662,461,581,197,312,000
                             LM Head     7,792,788,661,862,400
                               Total 2,184,825,811,658,342,400

Communication per GPU:
parallelism  size(GB)  count  time(seconds)
         TP    12.080 80.000          5.685
         DP    15.521  1.000          0.183
         PP     8.053  4.000          1.516

Time breakdown:
              name  time(seconds)
Forward / Backward         27.310
         PP Bubble          0.683
  PP Communication          1.516
  TP Communication          5.685
  DP Communication          0.183
             Total         35.376

Estimated FLOPs per second: 386.000 TFLOPs

@Ethan-yt
Copy link
Author

Ethan-yt commented Dec 7, 2023

@deepakn94 @zhipeng93

@Ethan-yt
Copy link
Author

Ethan-yt commented Jan 8, 2024

ping @jaredcasper @jon-barker

Copy link

github-actions bot commented Mar 8, 2024

Marking as stale. No activity in 60 days.

@github-actions github-actions bot added the stale No activity in 60 days on issue or PR label Mar 8, 2024
@Ethan-yt Ethan-yt closed this May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale No activity in 60 days on issue or PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant