SimuMax relies on three core input files: system, strategy, and model.
The strategy file defines training-side runtime choices such as TP / PP / EP, world size, batching, recompute, and VPP-related settings.
See also:
The strategy file is where SimuMax most directly mirrors Megatron runtime choices. If a real run and a strategy file disagree on PP / EP / TP, sequence parallelism, recompute, or VPP-related settings, both timing and memory can drift.
Do not start from an empty file unless you have to.
Recommended path:
- Copy the nearest existing JSON from configs/strategy.
- Keep
seq_len,micro_batch_size, andmicro_batch_numsimple first. - Make the parallel sizes legal before touching recompute or VPP.
- Only enable
interleaving_size > 1after the non-VPP strategy is already working.
Examples:
- dense TP/PP baseline: configs/strategy/tp1_pp2_dp4_mbs1.json
- MoE EP baseline: configs/strategy/ep8_pp1_dp8_mbs1.json
If you already have a reasonable starting strategy JSON, it is often easier to:
- keep the parallel strategy fixed and search
micro_batch_size/micro_batch_num - then search a small
tp/ppspace around the nearest existing config
Public references:
Search note:
gmi_erroris a simple per-rank memory margin in GiB for NCCL buffers and other runtime overheads that are not modeled explicitly- start with a conservative value such as
10on a new machine, then tighten it only after comparing against real memory usage
{
"seq_len": 4096,
"micro_batch_size": 1,
"micro_batch_num": 8,
"dtype": "bf16",
"world_size": 8,
"tp_size": 1,
"pp_size": 1,
"ep_size": 1,
"etp_size": 1,
"enable_sequence_parallel": false,
"interleaving_size": 1,
"zero_state": 1,
"enable_dropout": false,
"use_flash_sdp": true,
"enable_recompute": false,
"mem_factor": 0.94
}Start simple:
- dense model, no VPP, no recompute
world_size=8,tp=1,pp=1,ep=1,cp=1- this means the remaining parallelism is pure data parallel
Then add complexity one step at a time:
- increase
tp_sizeif a single layer is too large - increase
pp_sizeif the whole model is too large - use
ep_sizeonly for MoE models - use
interleaving_size > 1only after ordinary PP works
Fields you will usually set explicitly:
seq_lenmicro_batch_sizemicro_batch_numworld_sizetp_sizepp_sizeep_sizeetp_sizedtype
Fields many users can leave at the shipped defaults at first:
zero_stateenable_dropoutmem_factor- most
use_fused_*toggles - most recompute sub-switches
The most common dense relation is:
dp = world_size / (tp * pp * cp)
So for a legal dense config:
world_sizemust be divisible bytp * pp * cp
For MoE, SimuMax also checks:
world_size % (ep * etp * pp) == 0
So a practical rule is:
- get a legal dense
tp/pp/cpsplit first - then add
ep - then check model-specific expert divisibility such as
expert_num % ep == 0
Sequence length (number of tokens)
Micro-batch size (number of samples processed per forward propagation pass)
Number of micro-batches for gradient accumulation
Computation data type (bf16 indicates half-precision floating-point), default is bf16
Whether to use fp8 mixed precision training, default is false
Total number of GPUs (default is 8)
Tensor Parallelism size, default is 1
Pipeline Parallelism size - vertically splits model layers, default is 1
Expert Parallelism size - used for MOE models, default is 1
Expert Tensor Parallelism size, default is 1
Routing strategy for MOE models, default is "all2all". all2all-seq is deprecated and will be downgraded to all2all with a warning.
Whether to enable sequence parallelism, default is true, effective when tp_size > 1
Controls the number of layers contained in the first and last Pipeline Parallel stages, default is None
Virtual pipeline size. Keep it at 1 for the first working strategy.
When interleaving_size > 1, pp_size must also be greater than 1.
ZeRO optimization configuration, currently only supports zero0 and zero1, default is 1
Whether to use bf16 for gradient reduction, default is false
Whether to use weight accumulation fusion (reduces temporary variables), default is true
Whether to cache FP8 inputs for groupgemm, default is false
Whether to offload groupgemm inputs to CPU, default is false
Global switch for recompute, default is true
Granularity of recompute, options are "full_block" and "selective_recompute", default is None
Number of layers for recompute, default is 0
Recompute for attention module, default is false
Recompute for mla's rmsnorm and q/k up-projection, default is false
Recompute for MLP and groupedgemm, default is false
Recompute for rmsnorm+router+sharedExpert, default is false
Whether to remove redundant forward computation for the last module in recompute checkpoint, default is false.
When recompute_granularity is "selective_recompute", it is recommended to set this to true to save computation time.
Megatron-LM 0.14 introduced selective recompute based on discard_output.
Enable this mode with megatron_recompute=true and list the modules whose
outputs are discarded in megatron_recompute_modules.
Example:
{
"enable_recompute": true,
"recompute_granularity": "selective_recompute",
"recompute_layer_num": 12,
"megatron_recompute": true,
"megatron_recompute_modules": ["layernorm", "mlp"]
}Supported module names are layernorm, mla_up_proj, moe_act, mlp, and
moe. core_attn is reserved but not supported yet. This mode is mutually
exclusive with the legacy selective flags such as attn_recompute and
mlp_recompute; evaluate these strategies explicitly rather than through the
current search helper.
Attention sparse ratio (0.0 indicates dense attention), default is 0.0
Use FlashAttention acceleration
Whether to enable fused cross entropy in SimuMax, default is false.
Megatron mapping:
- SimuMax strategy field:
cross_entropy_loss_fusion=true - common shorthand in this repo:
ce_fusion - common case-name suffix in retained result tables:
_cef
For Megatron real runs, this shorthand means enabling both:
--cross-entropy-loss-fusion--cross-entropy-fusion-impl te
So ce_fusion / _cef in repo materials should be read as:
cross_entropy_loss_fusion=True- TE fused CE implementation
Various fused kernel optimizations
Whether to enable Dropout regularization, default is false
Network communication strategies for various parallel dimensions, default is "auto", automatically selected based on cluster scale and parallel strategy
Megatron-LM-related parameter for the MoE probs ownership path.
- Megatron-LM 0.14 and later: use
dispatch_probs=true - Megatron-LM 0.12 and earlier: use
dispatch_probs=false
For intermediate or locally patched runtimes, confirm the actual MoE path before choosing the flag.
Memory usage coefficient (0.94 means reserving 6% margin), used to estimate reserve_memory (=max_memory / mem_factor), default is 0.94