-
Notifications
You must be signed in to change notification settings - Fork 499
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Bug description
Crash does not happen in eager mode. cc @xmfan @tianyu-l
I tried clearing inductor cache, didn't help.
Repro command
rm -rf /tmp/torchinductor_${USER}; NGPU=4 CUDA_VISIBLE_DEVICES="1,2,3,4" CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.steps=50 --parallel ism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --compile.enable
Debug model config
I updated llama4 debug model to use same configs as 17bx16e model, but with only 2 layers:
llama4_configs = {
"debugmodel": TransformerModelArgs(
dim=5120,
n_layers=2,
n_heads=40,
n_kv_heads=8,
ffn_dim_multiplier=1.2,
multiple_of=2048,
rope_theta=500000,
max_seq_len=10485760,
moe_args=MoEArgs(num_experts=16),
interleave_moe_layer_step=1,
),
Crash logs
rm -rf /tmp/torchinductor_danvm; CUDA_VISIBLE_DEVICES="4,5,6,7" TORCHTITAN_ROOT=/home/danvm/torchtitan NGPU=4 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable" ./llama4.sh ^C
(torchtitan) [[email protected] ~/torchtitan (main)]$ rm -rf /tmp/torchinductor_danvm; NGPU=4 CUDA_VISIBLE_DEVICES="1,2,3,4" CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.steps=50 --parallel
ism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --compile.enable
+ NGPU=4
+ export LOG_RANK=0
+ LOG_RANK=0
+ CONFIG_FILE=./torchtitan/experiments/llama4/train_configs/debug_model.toml
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ PYTORCH_ALLOC_CONF=expandable_segments:True
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/experiments/llama4/train_configs/debug_model.toml --training.steps=50 --parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --compile.enable
W0825 21:33:51.703000 2232894 site-packages/torch/distributed/run.py:803]
W0825 21:33:51.703000 2232894 site-packages/torch/distributed/run.py:803] *****************************************
W0825 21:33:51.703000 2232894 site-packages/torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0825 21:33:51.703000 2232894 site-packages/torch/distributed/run.py:803] *****************************************
[rank0]:[titan] 2025-08-25 21:33:56,477 - root - INFO - Starting job: Llama 4 debug training
[rank0]:NCCL version 2.27.5+cuda12.9
[rank0]:[titan] 2025-08-25 21:33:57,773 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:[titan] 2025-08-25 21:33:57,775 - root - INFO - Building 2-D device mesh with ['dp_shard_mod_ep', 'dp_shard_in_ep'], [1, 4]
[rank0]:[titan] 2025-08-25 21:33:57,783 - root - INFO - [GC] Initial GC collection 0.00 seconds
[rank0]:[titan] 2025-08-25 21:34:01,238 - root - INFO - Loading tokenizer from tokenizer.json
[rank0]:[titan] 2025-08-25 21:34:01,239 - root - INFO - Preparing c4_test dataset from tests/assets/c4_test
[rank0]:[titan] 2025-08-25 21:34:01,320 - root - INFO - Building llama4 debugmodel with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=5120, n_layers=2, n_heads=40, n_kv_heads=8, vocab_size=202048, multiple_of=2048, ffn_dim_multiplier=1.2, norm_eps=1e-05, rope_theta=500000, max_seq_len=2048, depth_init=True, use_flex_attn=False, attn_mask_type='causal', every_n_layers_nope=None, fixed_attn_block_size=8192, moe_args=MoEArgs(num_experts=16, num_shared_experts=1, score_func='sigmoid', route_norm=False, route_scale=1.0, score_before_experts=True, top_k=1, use_grouped_mm=True, load_balance_coeff=0.001), auto_scale_hidden_dim=True, interleave_moe_layer_step=1)
[rank0]:[titan] 2025-08-25 21:34:01,340 - root - INFO - CUDA capacity: NVIDIA H100 with 95.00GiB memory
[rank0]:[titan] 2025-08-25 21:34:01,396 - root - INFO - Total parameter count: dense 2,194,826,240, sparse 4,278,353,920, active 2,698,306,560
[rank0]:[titan] 2025-08-25 21:34:01,396 - root - INFO - Model llama4 debugmodel size: 6,473,180,160 total parameters
[rank0]:[titan] 2025-08-25 21:34:01,396 - root - INFO - Compiling the loss function with torch.compile
[rank0]:[titan] 2025-08-25 21:34:01,415 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-08-25 21:34:01,417 - root - INFO - Compiling each TransformerBlock with torch.compile
[rank0]:[titan] 2025-08-25 21:34:01,432 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-08-25 21:34:02,083 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-08-25 21:34:02,083 - root - INFO - CUDA memory usage for model: 6.04GiB(6.36%)
[rank0]:[titan] 2025-08-25 21:34:02,084 - root - INFO - Mixed precision training is handled by fully_shard
[rank0]:[titan] 2025-08-25 21:34:02,084 - root - INFO - Trainer is initialized with local batch size 8, global batch size 32, gradient accumulation steps 1, sequence length 2048, total steps 50 (warmup 2)
[rank0]:[titan] 2025-08-25 21:34:02,084 - root - INFO - Training starts at step 1
[rank0]:/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/lowering.py:1937: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank0]: warnings.warn(
[rank0]:[rank0]:W0825 21:34:07.079000 2233551 site-packages/torch/_inductor/utils.py:2237] [2/0] DeviceCopy in input program
[rank0]:[rank0]:W0825 21:34:07.080000 2233551 site-packages/torch/_inductor/utils.py:2237] [2/0] DeviceCopy in input program
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [0,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [1,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [2,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [3,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [5,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [7,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [8,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [9,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [11,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [12,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [13,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [14,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [15,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [16,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [17,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [18,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [19,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [20,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [21,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [22,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [23,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [24,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [25,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [26,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [27,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [28,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [29,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [30,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [31,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [0,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [1,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [2,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [3,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [5,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [7,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [8,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [9,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [11,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [12,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [13,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [14,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [15,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [16,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [17,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [18,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [19,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [20,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [21,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [22,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [23,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [24,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [25,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [26,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [27,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [28,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [29,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [30,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [31,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [64,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [65,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [66,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [67,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [68,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [69,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [70,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [71,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [72,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [73,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [74,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [75,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [76,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [77,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [78,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [79,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [80,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [81,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [82,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [83,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [84,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [85,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [86,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [87,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [88,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [89,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [90,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [91,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [92,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [93,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [94,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [95,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:[rank0]: Traceback (most recent call last):
[rank0]:[rank0]: File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:[rank0]: File "<frozen runpy>", line 88, in _run_code
[rank0]:[rank0]: File "/home/danvm/torchtitan/torchtitan/train.py", line 656, in <module>
[rank0]:[rank0]: trainer.train()
[rank0]:[rank0]: ~~~~~~~~~~~~~^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
[rank0]:[rank0]: return f(*args, **kwargs)
[rank0]:[rank0]: File "/home/danvm/torchtitan/torchtitan/train.py", line 585, in train
[rank0]:[rank0]: self.train_step(data_iterator)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/torchtitan/torchtitan/train.py", line 485, in train_step
[rank0]:[rank0]: loss = self.forward_backward_step(input_dict, labels)
[rank0]:[rank0]: File "/home/danvm/torchtitan/torchtitan/train.py", line 461, in forward_backward_step
[rank0]:[rank0]: pred = model_parts[0](inputs)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1829, in inner
[rank0]:[rank0]: result = forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/danvm/torchtitan/torchtitan/experiments/llama4/model/model.py", line 477, in forward
[rank0]:[rank0]: h = layer(h, self.freqs_cis)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py", line 413, in __call__
[rank0]:[rank0]: return super().__call__(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1829, in inner
[rank0]:[rank0]: result = forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py", line 804, in compile_wrapper
[rank0]:[rank0]: return fn(*args, **kwargs)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank0]:[rank0]: return forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/danvm/torchtitan/torchtitan/experiments/llama4/model/model.py", line 352, in forward
[rank0]:[rank0]: out = h + self.moe(self.ffn_norm(h))
[rank0]:[rank0]: ~~~~~~~~^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank0]:[rank0]: return forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/danvm/torchtitan/torchtitan/models/moe.py", line 418, in forward
[rank0]:[rank0]: routed_output = self.experts(routed_input, num_tokens_per_expert)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1829, in inner
[rank0]:[rank0]: result = forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/danvm/torchtitan/torchtitan/models/moe.py", line 142, in forward
[rank0]:[rank0]: def forward(
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
[rank0]:[rank0]: return fn(*args, **kwargs)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/aot_autograd.py", line 1130, in forward
[rank0]:[rank0]: return compiled_fn(full_args)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 339, in runtime_wrapper
[rank0]:[rank0]: all_outs = call_func_at_runtime_with_args(
[rank0]:[rank0]: compiled_fn, args_, disable_amp=disable_amp, steal_args=True
[rank0]:[rank0]: )
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
[rank0]:[rank0]: out = normalize_as_list(f(args))
[rank0]:[rank0]: ~^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/utils.py", line 103, in g
[rank0]:[rank0]: return f(*args)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/autograd/function.py", line 581, in apply
[rank0]:[rank0]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank0]:[rank0]: ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2118, in forward
[rank0]:[rank0]: fw_outs = call_func_at_runtime_with_args(
[rank0]:[rank0]: CompiledFunction.compiled_fw,
[rank0]:[rank0]: args,
[rank0]:[rank0]: disable_amp=disable_amp,
[rank0]:[rank0]: )
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
[rank0]:[rank0]: out = normalize_as_list(f(args))
[rank0]:[rank0]: ~^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
[rank0]:[rank0]: return compiled_fn(runtime_args)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 690, in inner_fn
[rank0]:[rank0]: unwrapped_outs = compiled_fn(unwrapped_args)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 724, in inner_fn
[rank0]:[rank0]: outs = compiled_fn(args)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/output_code.py", line 588, in __call__
[rank0]:[rank0]: return self.current_callable(inputs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/utils.py", line 2897, in run
[rank0]:[rank0]: out = model(new_inputs)
[rank0]:[rank0]: File "/tmp/torchinductor_danvm/yn/cynnrkvurrqyedjoka7duy2omp3nftnnkw757tgim7j4neom3ahs.py", line 513, in call
[rank0]:[rank0]: triton_poi_fused_cat_index_unsqueeze_3.run(buf2, primals_6, buf6, s77, triton_poi_fused_cat_index_unsqueeze_3_xnumel, stream=stream0)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1202, in run
[rank0]:[rank0]: self.autotune_to_one_config(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 987, in autotune_to_one_config
[rank0]:[rank0]: timings = self.benchmark_all_configs(*args, **kwargs)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 962, in benchmark_all_configs
[rank0]:[rank0]: launcher: self.bench(launcher, *args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 830, in bench
[rank0]:[rank0]: return benchmarker.benchmark_gpu(kernel_call, rep=40)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper
[rank0]:[rank0]: return fn(self, *args, **kwargs)
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/runtime/benchmarking.py", line 251, in benchmark_gpu
[rank0]:[rank0]: torch.cuda.synchronize()
[rank0]:[rank0]: ~~~~~~~~~~~~~~~~~~~~~~^^
[rank0]:[rank0]: File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/cuda/__init__.py", line 1083, in synchronize
[rank0]:[rank0]: return torch._C._cuda_synchronize()
[rank0]:[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~^^
[rank0]:[rank0]: torch.AcceleratorError: CUDA error: device-side assert triggered
[rank0]:[rank0]: Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
[rank0]:[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]:[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]:[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[rank0]:
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [32,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [33,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [34,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [35,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [36,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [37,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [38,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [39,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [40,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [41,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [42,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [43,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [44,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [45,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [46,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [47,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [48,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [49,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [50,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [51,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [52,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [53,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [54,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [55,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [56,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [57,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [58,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [59,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [60,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [61,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [62,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [63,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:
[rank0]:[2025-08-25 21:34:20] devgpu007:2233551:2248260 [0] misc/strongstream.cc:403 NCCL WARN Cuda failure 'device-side assert triggered'
Versions
- torchtitan main branch with fresh git pull
- torch latest CUDA 12.8 nightly:
2.9.0.dev20250825+cu128
fegin and tianyu-l
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working