Questions about GA logic

I generated a workload such that each model instance processes two microbatches in the iteration:
```
HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 2 ep: 1 pp: 1 vpp: 2 ga: 2 all_gpus: 4 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0
24
grad_gather     -1      1       NONE    0       1       NONE    0       1       ALLGATHER       158859264       100
grad_param_comm -1      1       NONE    0       1       NONE    0       1       REDUCESCATTER   317718528       100
grad_param_compute      -1      1       NONE    0       34021000        NONE    0       1       NONE    0       100
layernorm       -1      1       NONE    0       1       ALLREDUCE       158859264       1       NONE    0       100
embedding_grads -1      1       NONE    0       1       ALLREDUCE       8388608 1       NONE    0       100
moe_grad_norm1  -1      1       NONE    0       1       NONE    0       1       ALLGATHER_DP_EP 0       100
moe_grad_norm2  -1      1       NONE    0       1       NONE    0       1       REDUCESCATTER_DP_EP     0       100
embedding_layer -1      799000  ALLREDUCE       8388608 1       NONE    0       17374000        NONE    0       100
attention_layer -1      1820000 ALLREDUCE       8388608 1820000 NONE    0       1820000 NONE    0       100
mlp_layer       -1      2478000 ALLREDUCE       8388608 2478000 NONE    0       2478000 NONE    0       100
attention_layer -1      1820000 ALLREDUCE       8388608 1820000 NONE    0       1820000 NONE    0       100
mlp_layer       -1      2478000 ALLREDUCE       8388608 2478000 NONE    0       2478000 NONE    0       100
embedding_layer -1      799000  ALLREDUCE       8388608 1       NONE    0       2478000 NONE    0       100
attention_layer -1      1820000 ALLREDUCE       8388608 1820000 NONE    0       1820000 NONE    0       100
mlp_layer       -1      2478000 ALLREDUCE       8388608 2478000 NONE    0       2478000 NONE    0       100
attention_layer -1      1820000 ALLREDUCE       8388608 1820000 NONE    0       1820000 NONE    0       100
mlp_layer       -1      2478000 ALLREDUCE       8388608 2478000 NONE    0       2478000 NONE    0       100
cross_entropy1  -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
cross_entropy2  -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
cross_entropy3  -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
optimizer1      -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
optimizer2      -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
optimizer3      -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
optimizer4      -1      0       ALLREDUCE       16384   0       NONE    0       0       NONE    0       100
```

From inspecting `Workload::iterate_hybrid_parallel_Transformer_fwd_in_bckwd`, my current understanding of the execution order is:
1. Forward pass for microbatch 0
2. Forward pass for microbatch 1
3. Compute loss (presumably for both microbatches since both forward passes are complete)
4. Optimizer steps (**shouldn't this happen after gradients are synced?**)
5. Backward pass for microbatch 1
6. Backward pass for microbatch 0 (gradients are accumulated)
7. Gradient synchronization across model instances
    

However, based on my understanding of typical microbatching, I would expect the following order:

1. Forward pass for microbatch 0
2. Compute loss for microbatch 0
3. Backward pass for microbatch 0
4. Forward pass for microbatch 1 (activations for microbatch 0 can now be discarded)
5. Compute loss for microbatch 1
6. Backward pass for microbatch 1 (accumulates gradients)
7. Gradient synchronization across model instances
8. Optimizer step
    

**Scenario 1**: Both microbatches’ activations must be stored concurrently, increasing activation memory footprint.
**Scenario 2**: Only one microbatch's activations need to be stored at a time, reducing activation memory footprint.

In both scenarios, the gradient memory requirements are unchanged.

1. Am I understanding the current behavior correctly? If so, is there a reason the forward passes of all microbatches are done before backward passes?
2. Why are optimizer steps scheduled before the backward pass and gradient sync? Wouldn’t this lead to stale or incomplete updates?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about GA logic #157

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions about GA logic #157

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions