Skip to content

Llama4 DP=4, EP=4 with torch.compile crashes with inductor codegen triton kernel error #1640

@danielvegamyhre

Description

@danielvegamyhre

Bug description

Crash does not happen in eager mode. cc @xmfan @tianyu-l

I tried clearing inductor cache, didn't help.

Repro command

  • rm -rf /tmp/torchinductor_${USER}; NGPU=4 CUDA_VISIBLE_DEVICES="1,2,3,4" CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.steps=50 --parallel ism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --compile.enable

Debug model config

I updated llama4 debug model to use same configs as 17bx16e model, but with only 2 layers:

llama4_configs = {
    "debugmodel": TransformerModelArgs(
        dim=5120,
        n_layers=2,
        n_heads=40,
        n_kv_heads=8,
        ffn_dim_multiplier=1.2,
        multiple_of=2048,
        rope_theta=500000,
        max_seq_len=10485760,
        moe_args=MoEArgs(num_experts=16),
        interleave_moe_layer_step=1,
    ),

Crash logs

rm -rf /tmp/torchinductor_danvm; CUDA_VISIBLE_DEVICES="4,5,6,7" TORCHTITAN_ROOT=/home/danvm/torchtitan NGPU=4 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable" ./llama4.sh ^C
(torchtitan) [[email protected] ~/torchtitan (main)]$ rm -rf /tmp/torchinductor_danvm; NGPU=4 CUDA_VISIBLE_DEVICES="1,2,3,4" CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.steps=50 --parallel
ism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --compile.enable
+ NGPU=4
+ export LOG_RANK=0
+ LOG_RANK=0
+ CONFIG_FILE=./torchtitan/experiments/llama4/train_configs/debug_model.toml
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ PYTORCH_ALLOC_CONF=expandable_segments:True
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/experiments/llama4/train_configs/debug_model.toml --training.steps=50 --parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --compile.enable
W0825 21:33:51.703000 2232894 site-packages/torch/distributed/run.py:803] 
W0825 21:33:51.703000 2232894 site-packages/torch/distributed/run.py:803] *****************************************
W0825 21:33:51.703000 2232894 site-packages/torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0825 21:33:51.703000 2232894 site-packages/torch/distributed/run.py:803] *****************************************
[rank0]:[titan] 2025-08-25 21:33:56,477 - root - INFO - Starting job: Llama 4 debug training
[rank0]:NCCL version 2.27.5+cuda12.9
[rank0]:[titan] 2025-08-25 21:33:57,773 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:[titan] 2025-08-25 21:33:57,775 - root - INFO - Building 2-D device mesh with ['dp_shard_mod_ep', 'dp_shard_in_ep'], [1, 4]
[rank0]:[titan] 2025-08-25 21:33:57,783 - root - INFO - [GC] Initial GC collection 0.00 seconds
[rank0]:[titan] 2025-08-25 21:34:01,238 - root - INFO - Loading tokenizer from tokenizer.json
[rank0]:[titan] 2025-08-25 21:34:01,239 - root - INFO - Preparing c4_test dataset from tests/assets/c4_test
[rank0]:[titan] 2025-08-25 21:34:01,320 - root - INFO - Building llama4 debugmodel with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=5120, n_layers=2, n_heads=40, n_kv_heads=8, vocab_size=202048, multiple_of=2048, ffn_dim_multiplier=1.2, norm_eps=1e-05, rope_theta=500000, max_seq_len=2048, depth_init=True, use_flex_attn=False, attn_mask_type='causal', every_n_layers_nope=None, fixed_attn_block_size=8192, moe_args=MoEArgs(num_experts=16, num_shared_experts=1, score_func='sigmoid', route_norm=False, route_scale=1.0, score_before_experts=True, top_k=1, use_grouped_mm=True, load_balance_coeff=0.001), auto_scale_hidden_dim=True, interleave_moe_layer_step=1)
[rank0]:[titan] 2025-08-25 21:34:01,340 - root - INFO - CUDA capacity: NVIDIA H100 with 95.00GiB memory
[rank0]:[titan] 2025-08-25 21:34:01,396 - root - INFO - Total parameter count: dense 2,194,826,240, sparse 4,278,353,920, active 2,698,306,560
[rank0]:[titan] 2025-08-25 21:34:01,396 - root - INFO - Model llama4 debugmodel size: 6,473,180,160 total parameters
[rank0]:[titan] 2025-08-25 21:34:01,396 - root - INFO - Compiling the loss function with torch.compile
[rank0]:[titan] 2025-08-25 21:34:01,415 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-08-25 21:34:01,417 - root - INFO - Compiling each TransformerBlock with torch.compile
[rank0]:[titan] 2025-08-25 21:34:01,432 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-08-25 21:34:02,083 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-08-25 21:34:02,083 - root - INFO - CUDA memory usage for model: 6.04GiB(6.36%)
[rank0]:[titan] 2025-08-25 21:34:02,084 - root - INFO - Mixed precision training is handled by fully_shard
[rank0]:[titan] 2025-08-25 21:34:02,084 - root - INFO - Trainer is initialized with local batch size 8, global batch size 32, gradient accumulation steps 1, sequence length 2048, total steps 50 (warmup 2)
[rank0]:[titan] 2025-08-25 21:34:02,084 - root - INFO - Training starts at step 1
[rank0]:/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/lowering.py:1937: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank0]:  warnings.warn(
[rank0]:[rank0]:W0825 21:34:07.079000 2233551 site-packages/torch/_inductor/utils.py:2237] [2/0] DeviceCopy in input program
[rank0]:[rank0]:W0825 21:34:07.080000 2233551 site-packages/torch/_inductor/utils.py:2237] [2/0] DeviceCopy in input program
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [0,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [1,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [2,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [3,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [5,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [7,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [8,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [9,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [11,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [12,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [13,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [14,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [15,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [16,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [17,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [18,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [19,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [20,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [21,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [22,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [23,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [24,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [25,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [26,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [27,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [28,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [29,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [30,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24230,0,0], thread: [31,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [0,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [1,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [2,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [3,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [5,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [7,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [8,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [9,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [11,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [12,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [13,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [14,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [15,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [16,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [17,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [18,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [19,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [20,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [21,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [22,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [23,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [24,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [25,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [26,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [27,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [28,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [29,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [30,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [31,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [64,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [65,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [66,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [67,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [68,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [69,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [70,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [71,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [72,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [73,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [74,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [75,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [76,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [77,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [78,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [79,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [80,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [81,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [82,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [83,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [84,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [85,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [86,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [87,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [88,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [89,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [90,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [91,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [92,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [93,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [94,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24464,0,0], thread: [95,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:[rank0]: Traceback (most recent call last):
[rank0]:[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:[rank0]:   File "/home/danvm/torchtitan/torchtitan/train.py", line 656, in <module>
[rank0]:[rank0]:     trainer.train()
[rank0]:[rank0]:     ~~~~~~~~~~~~~^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
[rank0]:[rank0]:     return f(*args, **kwargs)
[rank0]:[rank0]:   File "/home/danvm/torchtitan/torchtitan/train.py", line 585, in train
[rank0]:[rank0]:     self.train_step(data_iterator)
[rank0]:[rank0]:     ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/torchtitan/torchtitan/train.py", line 485, in train_step
[rank0]:[rank0]:     loss = self.forward_backward_step(input_dict, labels)
[rank0]:[rank0]:   File "/home/danvm/torchtitan/torchtitan/train.py", line 461, in forward_backward_step
[rank0]:[rank0]:     pred = model_parts[0](inputs)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:[rank0]:            ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
[rank0]:[rank0]:     return inner()
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1829, in inner
[rank0]:[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:[rank0]:   File "/home/danvm/torchtitan/torchtitan/experiments/llama4/model/model.py", line 477, in forward
[rank0]:[rank0]:     h = layer(h, self.freqs_cis)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py", line 413, in __call__
[rank0]:[rank0]:     return super().__call__(*args, **kwargs)
[rank0]:[rank0]:            ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:[rank0]:            ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
[rank0]:[rank0]:     return inner()
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1829, in inner
[rank0]:[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py", line 804, in compile_wrapper
[rank0]:[rank0]:     return fn(*args, **kwargs)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:[rank0]:            ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank0]:[rank0]:     return forward_call(*args, **kwargs)
[rank0]:[rank0]:   File "/home/danvm/torchtitan/torchtitan/experiments/llama4/model/model.py", line 352, in forward
[rank0]:[rank0]:     out = h + self.moe(self.ffn_norm(h))
[rank0]:[rank0]:               ~~~~~~~~^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:[rank0]:            ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank0]:[rank0]:     return forward_call(*args, **kwargs)
[rank0]:[rank0]:   File "/home/danvm/torchtitan/torchtitan/models/moe.py", line 418, in forward
[rank0]:[rank0]:     routed_output = self.experts(routed_input, num_tokens_per_expert)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:[rank0]:            ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
[rank0]:[rank0]:     return inner()
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1829, in inner
[rank0]:[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:[rank0]:   File "/home/danvm/torchtitan/torchtitan/models/moe.py", line 142, in forward
[rank0]:[rank0]:     def forward(
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
[rank0]:[rank0]:     return fn(*args, **kwargs)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/aot_autograd.py", line 1130, in forward
[rank0]:[rank0]:     return compiled_fn(full_args)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 339, in runtime_wrapper
[rank0]:[rank0]:     all_outs = call_func_at_runtime_with_args(
[rank0]:[rank0]:         compiled_fn, args_, disable_amp=disable_amp, steal_args=True
[rank0]:[rank0]:     )
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
[rank0]:[rank0]:     out = normalize_as_list(f(args))
[rank0]:[rank0]:                             ~^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/utils.py", line 103, in g
[rank0]:[rank0]:     return f(*args)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/autograd/function.py", line 581, in apply
[rank0]:[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:[rank0]:            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2118, in forward
[rank0]:[rank0]:     fw_outs = call_func_at_runtime_with_args(
[rank0]:[rank0]:         CompiledFunction.compiled_fw,
[rank0]:[rank0]:         args,
[rank0]:[rank0]:         disable_amp=disable_amp,
[rank0]:[rank0]:     )
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
[rank0]:[rank0]:     out = normalize_as_list(f(args))
[rank0]:[rank0]:                             ~^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
[rank0]:[rank0]:     return compiled_fn(runtime_args)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 690, in inner_fn
[rank0]:[rank0]:     unwrapped_outs = compiled_fn(unwrapped_args)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 724, in inner_fn
[rank0]:[rank0]:     outs = compiled_fn(args)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/output_code.py", line 588, in __call__
[rank0]:[rank0]:     return self.current_callable(inputs)
[rank0]:[rank0]:            ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/utils.py", line 2897, in run
[rank0]:[rank0]:     out = model(new_inputs)
[rank0]:[rank0]:   File "/tmp/torchinductor_danvm/yn/cynnrkvurrqyedjoka7duy2omp3nftnnkw757tgim7j4neom3ahs.py", line 513, in call
[rank0]:[rank0]:     triton_poi_fused_cat_index_unsqueeze_3.run(buf2, primals_6, buf6, s77, triton_poi_fused_cat_index_unsqueeze_3_xnumel, stream=stream0)
[rank0]:[rank0]:     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1202, in run
[rank0]:[rank0]:     self.autotune_to_one_config(*args, **kwargs)
[rank0]:[rank0]:     ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 987, in autotune_to_one_config
[rank0]:[rank0]:     timings = self.benchmark_all_configs(*args, **kwargs)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 962, in benchmark_all_configs
[rank0]:[rank0]:     launcher: self.bench(launcher, *args, **kwargs)
[rank0]:[rank0]:               ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 830, in bench
[rank0]:[rank0]:     return benchmarker.benchmark_gpu(kernel_call, rep=40)
[rank0]:[rank0]:            ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper
[rank0]:[rank0]:     return fn(self, *args, **kwargs)
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_inductor/runtime/benchmarking.py", line 251, in benchmark_gpu
[rank0]:[rank0]:     torch.cuda.synchronize()
[rank0]:[rank0]:     ~~~~~~~~~~~~~~~~~~~~~~^^
[rank0]:[rank0]:   File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/cuda/__init__.py", line 1083, in synchronize
[rank0]:[rank0]:     return torch._C._cuda_synchronize()
[rank0]:[rank0]:            ~~~~~~~~~~~~~~~~~~~~~~~~~~^^
[rank0]:[rank0]: torch.AcceleratorError: CUDA error: device-side assert triggered
[rank0]:[rank0]: Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
[rank0]:[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]:[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]:[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[rank0]:
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [32,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [33,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [34,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [35,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [36,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [37,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [38,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [39,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [40,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [41,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [42,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [43,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [44,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [45,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [46,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [47,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [48,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [49,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [50,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [51,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [52,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [53,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [54,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [55,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [56,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [57,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [58,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [59,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [60,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [61,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [62,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:/tmp/torchinductor_danvm/k4/ck46uf2tjbfns332dtzd3q3m2mgw2hoguozdpq5bvqyjhntqiu23.py:30: unknown: block: [24127,0,0], thread: [63,0,0] Assertion `index out of bounds: 0 <= tmp4 < 1 + ks0` failed.
[rank0]:
[rank0]:[2025-08-25 21:34:20] devgpu007:2233551:2248260 [0] misc/strongstream.cc:403 NCCL WARN Cuda failure 'device-side assert triggered'

Versions

  • torchtitan main branch with fresh git pull
  • torch latest CUDA 12.8 nightly: 2.9.0.dev20250825+cu128

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions