Skip to content

Questions about GA logic #157

@Ricefrog

Description

@Ricefrog

I generated a workload such that each model instance processes two microbatches in the iteration:

HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 2 ep: 1 pp: 1 vpp: 2 ga: 2 all_gpus: 4 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0
24
grad_gather     -1      1       NONE    0       1       NONE    0       1       ALLGATHER       158859264       100
grad_param_comm -1      1       NONE    0       1       NONE    0       1       REDUCESCATTER   317718528       100
grad_param_compute      -1      1       NONE    0       34021000        NONE    0       1       NONE    0       100
layernorm       -1      1       NONE    0       1       ALLREDUCE       158859264       1       NONE    0       100
embedding_grads -1      1       NONE    0       1       ALLREDUCE       8388608 1       NONE    0       100
moe_grad_norm1  -1      1       NONE    0       1       NONE    0       1       ALLGATHER_DP_EP 0       100
moe_grad_norm2  -1      1       NONE    0       1       NONE    0       1       REDUCESCATTER_DP_EP     0       100
embedding_layer -1      799000  ALLREDUCE       8388608 1       NONE    0       17374000        NONE    0       100
attention_layer -1      1820000 ALLREDUCE       8388608 1820000 NONE    0       1820000 NONE    0       100
mlp_layer       -1      2478000 ALLREDUCE       8388608 2478000 NONE    0       2478000 NONE    0       100
attention_layer -1      1820000 ALLREDUCE       8388608 1820000 NONE    0       1820000 NONE    0       100
mlp_layer       -1      2478000 ALLREDUCE       8388608 2478000 NONE    0       2478000 NONE    0       100
embedding_layer -1      799000  ALLREDUCE       8388608 1       NONE    0       2478000 NONE    0       100
attention_layer -1      1820000 ALLREDUCE       8388608 1820000 NONE    0       1820000 NONE    0       100
mlp_layer       -1      2478000 ALLREDUCE       8388608 2478000 NONE    0       2478000 NONE    0       100
attention_layer -1      1820000 ALLREDUCE       8388608 1820000 NONE    0       1820000 NONE    0       100
mlp_layer       -1      2478000 ALLREDUCE       8388608 2478000 NONE    0       2478000 NONE    0       100
cross_entropy1  -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
cross_entropy2  -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
cross_entropy3  -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
optimizer1      -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
optimizer2      -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
optimizer3      -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
optimizer4      -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100

From inspecting Workload::iterate_hybrid_parallel_Transformer_fwd_in_bckwd, my current understanding of the execution order is:

  1. Forward pass for microbatch 0
  2. Forward pass for microbatch 1
  3. Compute loss (presumably for both microbatches since both forward passes are complete)
  4. Optimizer steps (shouldn't this happen after gradients are synced?)
  5. Backward pass for microbatch 1
  6. Backward pass for microbatch 0 (gradients are accumulated)
  7. Gradient synchronization across model instances

However, based on my understanding of typical microbatching, I would expect the following order:

  1. Forward pass for microbatch 0
  2. Compute loss for microbatch 0
  3. Backward pass for microbatch 0
  4. Forward pass for microbatch 1 (activations for microbatch 0 can now be discarded)
  5. Compute loss for microbatch 1
  6. Backward pass for microbatch 1 (accumulates gradients)
  7. Gradient synchronization across model instances
  8. Optimizer step

Scenario 1: Both microbatches’ activations must be stored concurrently, increasing activation memory footprint.
Scenario 2: Only one microbatch's activations need to be stored at a time, reducing activation memory footprint.

In both scenarios, the gradient memory requirements are unchanged.

  1. Am I understanding the current behavior correctly? If so, is there a reason the forward passes of all microbatches are done before backward passes?
  2. Why are optimizer steps scheduled before the backward pass and gradient sync? Wouldn’t this lead to stale or incomplete updates?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions