-
Notifications
You must be signed in to change notification settings - Fork 103
Open
Description
I generated a workload such that each model instance processes two microbatches in the iteration:
HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 2 ep: 1 pp: 1 vpp: 2 ga: 2 all_gpus: 4 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0
24
grad_gather -1 1 NONE 0 1 NONE 0 1 ALLGATHER 158859264 100
grad_param_comm -1 1 NONE 0 1 NONE 0 1 REDUCESCATTER 317718528 100
grad_param_compute -1 1 NONE 0 34021000 NONE 0 1 NONE 0 100
layernorm -1 1 NONE 0 1 ALLREDUCE 158859264 1 NONE 0 100
embedding_grads -1 1 NONE 0 1 ALLREDUCE 8388608 1 NONE 0 100
moe_grad_norm1 -1 1 NONE 0 1 NONE 0 1 ALLGATHER_DP_EP 0 100
moe_grad_norm2 -1 1 NONE 0 1 NONE 0 1 REDUCESCATTER_DP_EP 0 100
embedding_layer -1 799000 ALLREDUCE 8388608 1 NONE 0 17374000 NONE 0 100
attention_layer -1 1820000 ALLREDUCE 8388608 1820000 NONE 0 1820000 NONE 0 100
mlp_layer -1 2478000 ALLREDUCE 8388608 2478000 NONE 0 2478000 NONE 0 100
attention_layer -1 1820000 ALLREDUCE 8388608 1820000 NONE 0 1820000 NONE 0 100
mlp_layer -1 2478000 ALLREDUCE 8388608 2478000 NONE 0 2478000 NONE 0 100
embedding_layer -1 799000 ALLREDUCE 8388608 1 NONE 0 2478000 NONE 0 100
attention_layer -1 1820000 ALLREDUCE 8388608 1820000 NONE 0 1820000 NONE 0 100
mlp_layer -1 2478000 ALLREDUCE 8388608 2478000 NONE 0 2478000 NONE 0 100
attention_layer -1 1820000 ALLREDUCE 8388608 1820000 NONE 0 1820000 NONE 0 100
mlp_layer -1 2478000 ALLREDUCE 8388608 2478000 NONE 0 2478000 NONE 0 100
cross_entropy1 -1 0 ALLREDUCE 16384 0 NONE 0 0 NONE 0 100
cross_entropy2 -1 0 ALLREDUCE 16384 0 NONE 0 0 NONE 0 100
cross_entropy3 -1 0 ALLREDUCE 16384 0 NONE 0 0 NONE 0 100
optimizer1 -1 0 ALLREDUCE 16384 0 NONE 0 0 NONE 0 100
optimizer2 -1 0 ALLREDUCE 16384 0 NONE 0 0 NONE 0 100
optimizer3 -1 0 ALLREDUCE 16384 0 NONE 0 0 NONE 0 100
optimizer4 -1 0 ALLREDUCE 16384 0 NONE 0 0 NONE 0 100
From inspecting Workload::iterate_hybrid_parallel_Transformer_fwd_in_bckwd
, my current understanding of the execution order is:
- Forward pass for microbatch 0
- Forward pass for microbatch 1
- Compute loss (presumably for both microbatches since both forward passes are complete)
- Optimizer steps (shouldn't this happen after gradients are synced?)
- Backward pass for microbatch 1
- Backward pass for microbatch 0 (gradients are accumulated)
- Gradient synchronization across model instances
However, based on my understanding of typical microbatching, I would expect the following order:
- Forward pass for microbatch 0
- Compute loss for microbatch 0
- Backward pass for microbatch 0
- Forward pass for microbatch 1 (activations for microbatch 0 can now be discarded)
- Compute loss for microbatch 1
- Backward pass for microbatch 1 (accumulates gradients)
- Gradient synchronization across model instances
- Optimizer step
Scenario 1: Both microbatches’ activations must be stored concurrently, increasing activation memory footprint.
Scenario 2: Only one microbatch's activations need to be stored at a time, reducing activation memory footprint.
In both scenarios, the gradient memory requirements are unchanged.
- Am I understanding the current behavior correctly? If so, is there a reason the forward passes of all microbatches are done before backward passes?
- Why are optimizer steps scheduled before the backward pass and gradient sync? Wouldn’t this lead to stale or incomplete updates?
Metadata
Metadata
Assignees
Labels
No labels