Skip to content

Error while full finetuning Llama 4 Scout #2885

@agunapal

Description

@agunapal

I am running this with nightlies

Logs

tune run --nproc_per_node 8 full_finetune_distributed --config recipes/configs/llama4/scout_17B_16E_full.yaml batch_size=4 epochs=10
Running with torchrun...
W0716 20:03:05.675000 63909 site-packages/torch/distributed/run.py:774] 
W0716 20:03:05.675000 63909 site-packages/torch/distributed/run.py:774] *****************************************
W0716 20:03:05.675000 63909 site-packages/torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0716 20:03:05.675000 63909 site-packages/torch/distributed/run.py:774] *****************************************
INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 4
batch_size_val: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
  checkpoint_files:
    filename_format: model-{}-of-{}.safetensors
    max_filename: '00050'
  model_type: LLAMA4
  output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
dataset_val:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
  _component_: torch.nn.CrossEntropyLoss
  ignore_index: -100
  reduction: mean
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
  _component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
  _component_: torch.optim.AdamW
  fused: false
  lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
  _component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
  _component_: torchtune.models.llama4.llama4_transform
  max_num_tiles: 16
  max_seq_len: null
  path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 4
batch_size_val: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
  checkpoint_files:
    filename_format: model-{}-of-{}.safetensors
    max_filename: '00050'
  model_type: LLAMA4
  output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
dataset_val:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
  _component_: torch.nn.CrossEntropyLoss
  ignore_index: -100
  reduction: mean
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
  _component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
  _component_: torch.optim.AdamW
  fused: false
  lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
  _component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
  _component_: torchtune.models.llama4.llama4_transform
  max_num_tiles: 16
  max_seq_len: null
  path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 4
batch_size_val: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
  checkpoint_files:
    filename_format: model-{}-of-{}.safetensors
    max_filename: '00050'
  model_type: LLAMA4
  output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
dataset_val:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
  _component_: torch.nn.CrossEntropyLoss
  ignore_index: -100
  reduction: mean
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
  _component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
  _component_: torch.optim.AdamW
  fused: false
  lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
  _component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
  _component_: torchtune.models.llama4.llama4_transform
  max_num_tiles: 16
  max_seq_len: null
  path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 4
batch_size_val: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
  checkpoint_files:
    filename_format: model-{}-of-{}.safetensors
    max_filename: '00050'
  model_type: LLAMA4
  output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
dataset_val:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
  _component_: torch.nn.CrossEntropyLoss
  ignore_index: -100
  reduction: mean
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
  _component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
  _component_: torch.optim.AdamW
  fused: false
  lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
  _component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
  _component_: torchtune.models.llama4.llama4_transform
  max_num_tiles: 16
  max_seq_len: null
  path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 4
batch_size_val: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
  checkpoint_files:
    filename_format: model-{}-of-{}.safetensors
    max_filename: '00050'
  model_type: LLAMA4
  output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
dataset_val:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
  _component_: torch.nn.CrossEntropyLoss
  ignore_index: -100
  reduction: mean
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
  _component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
  _component_: torch.optim.AdamW
  fused: false
  lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
  _component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
  _component_: torchtune.models.llama4.llama4_transform
  max_num_tiles: 16
  max_seq_len: null
  path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 4
batch_size_val: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
  checkpoint_files:
    filename_format: model-{}-of-{}.safetensors
    max_filename: '00050'
  model_type: LLAMA4
  output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
dataset_val:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
  _component_: torch.nn.CrossEntropyLoss
  ignore_index: -100
  reduction: mean
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
  _component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
  _component_: torch.optim.AdamW
  fused: false
  lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
  _component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
  _component_: torchtune.models.llama4.llama4_transform
  max_num_tiles: 16
  max_seq_len: null
  path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 4
batch_size_val: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
  checkpoint_files:
    filename_format: model-{}-of-{}.safetensors
    max_filename: '00050'
  model_type: LLAMA4
  output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
dataset_val:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
  _component_: torch.nn.CrossEntropyLoss
  ignore_index: -100
  reduction: mean
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
  _component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
  _component_: torch.optim.AdamW
  fused: false
  lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
  _component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
  _component_: torchtune.models.llama4.llama4_transform
  max_num_tiles: 16
  max_seq_len: null
  path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 4
batch_size_val: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
  checkpoint_files:
    filename_format: model-{}-of-{}.safetensors
    max_filename: '00050'
  model_type: LLAMA4
  output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
dataset_val:
  _component_: torchtune.datasets.chat_dataset
  conversation_column: conversations
  conversation_style: openai
  data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
  packed: false
  source: json
  split: train[:95%]
  train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
  _component_: torch.nn.CrossEntropyLoss
  ignore_index: -100
  reduction: mean
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
  _component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
  _component_: torch.optim.AdamW
  fused: false
  lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
  _component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
  _component_: torchtune.models.llama4.llama4_transform
  max_num_tiles: 16
  max_seq_len: null
  path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model

[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank [Gloo] Rank 62 is connected to  is connected to 77 peer ranks.  peer ranks. Expected number of connected peer ranks is : Expected number of connected peer ranks is : 77

[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank [Gloo] Rank 3 is connected to 3 peer ranks. 2Expected number of connected peer ranks is :  is connected to 33
 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank [Gloo] Rank [Gloo] Rank 100 is connected to  is connected to  is connected to 111 peer ranks.  peer ranks.  peer ranks. Expected number of connected peer ranks is : Expected number of connected peer ranks is : Expected number of connected peer ranks is : 111


[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank [Gloo] Rank 1 is connected to 3 peer ranks. 2Expected number of connected peer ranks is :  is connected to 33
 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank [Gloo] Rank 3 is connected to 32 peer ranks.  is connected to Expected number of connected peer ranks is : 33 peer ranks. 
Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank [Gloo] Rank 3 is connected to 23 is connected to  peer ranks. 3Expected number of connected peer ranks is :  peer ranks. 3Expected number of connected peer ranks is : 
3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO:torchtune.utils._logging:Hint: enable_activation_checkpointing is True, but enable_activation_offloading isn't. Enabling activation offloading should reduce memory further.
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
Writing logs to /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs/log_1752696194.txt
INFO:torchtune.utils._logging:Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
INFO:torchtune.utils._logging:Instantiating model and loading checkpoint took 410.16 secs
INFO:torchtune.utils._logging:Memory stats after model init:
	GPU peak memory active: 12.54 GiB
	GPU peak memory alloc: 12.54 GiB
	GPU peak memory reserved: 12.55 GiB
INFO:torchtune.utils._logging:Optimizer is initialized.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}

Error



1|59|Loss: 0.8372806906700134: 100%|██████████████████████████████████████████████████████████| 59/59 [36:29<00:00, 33.91s/it]INFO:torchtune.utils._logging:Saving checkpoint. This may take some time. Retrieving full model state dict...
INFO:torchtune.utils._logging:Getting full model state dict took 70.13 secs
INFO:torchtune.utils._logging:Getting optimizer state dict...
INFO:torchtune.utils._logging:Getting optimizer state dict took 291.04 secs
[rank7]: Traceback (most recent call last):
[rank7]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank7]:     sys.exit(recipe_main())
[rank7]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank7]:     sys.exit(recipe_main(conf))
[rank7]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank7]:     recipe.train()
[rank7]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank7]:     self._checkpoint_client.save_checkpoint(
[rank7]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank7]:     self._save_checkpoint_sync(
[rank7]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank7]:     torch.distributed.barrier()
[rank7]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank7]:     work.wait()
[rank7]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:30610
[rank1]: Traceback (most recent call last):
[rank1]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank1]:     sys.exit(recipe_main())
[rank1]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank1]:     sys.exit(recipe_main(conf))
[rank1]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank1]:     recipe.train()
[rank1]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank1]:     self._checkpoint_client.save_checkpoint(
[rank1]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank1]:     self._save_checkpoint_sync(
[rank1]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank1]:     torch.distributed.barrier()
[rank1]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank1]:     work.wait()
[rank1]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:30610
[rank4]: Traceback (most recent call last):
[rank4]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank4]:     sys.exit(recipe_main())
[rank4]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank4]:     sys.exit(recipe_main(conf))
[rank4]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank4]:     recipe.train()
[rank4]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank4]:     self._checkpoint_client.save_checkpoint(
[rank4]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank4]:     self._save_checkpoint_sync(
[rank4]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank4]:     torch.distributed.barrier()
[rank4]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank4]:     work.wait()
[rank4]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:30610
[rank2]: Traceback (most recent call last):
[rank2]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank2]:     sys.exit(recipe_main())
[rank2]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank2]:     sys.exit(recipe_main(conf))
[rank2]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank2]:     recipe.train()
[rank2]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank2]:     self._checkpoint_client.save_checkpoint(
[rank2]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank2]:     self._save_checkpoint_sync(
[rank2]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank2]:     torch.distributed.barrier()
[rank2]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank2]:     work.wait()
[rank2]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:42181
[rank6]: Traceback (most recent call last):
[rank6]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank6]:     sys.exit(recipe_main())
[rank6]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank6]:     sys.exit(recipe_main(conf))
[rank6]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank6]:     recipe.train()
[rank6]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank6]:     self._checkpoint_client.save_checkpoint(
[rank6]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank6]:     self._save_checkpoint_sync(
[rank6]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank6]:     torch.distributed.barrier()
[rank6]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank6]:     work.wait()
[rank6]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:30610
[rank4]:[W716 20:53:12.395308535 ProcessGroupNCCL.cpp:1566] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank2]:[W716 20:53:12.423445497 ProcessGroupNCCL.cpp:1566] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank6]:[W716 20:53:14.958723379 ProcessGroupNCCL.cpp:1566] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank1]:[W716 20:53:14.022187294 ProcessGroupNCCL.cpp:1566] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank3]: Traceback (most recent call last):
[rank3]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank3]:     sys.exit(recipe_main())
[rank3]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank3]:     sys.exit(recipe_main(conf))
[rank3]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank3]:     recipe.train()
[rank3]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank3]:     self._checkpoint_client.save_checkpoint(
[rank3]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank3]:     self._save_checkpoint_sync(
[rank3]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank3]:     torch.distributed.barrier()
[rank3]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank3]:     work.wait()
[rank3]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:36068
[rank5]: Traceback (most recent call last):
[rank5]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank5]:     sys.exit(recipe_main())
[rank5]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank5]:     sys.exit(recipe_main(conf))
[rank5]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank5]:     recipe.train()
[rank5]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank5]:     self._checkpoint_client.save_checkpoint(
[rank5]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank5]:     self._save_checkpoint_sync(
[rank5]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank5]:     torch.distributed.barrier()
[rank5]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank5]:     work.wait()
[rank5]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:61150
W0716 20:53:26.982000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64189 closing signal SIGTERM
W0716 20:53:26.983000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64190 closing signal SIGTERM
W0716 20:53:26.983000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64191 closing signal SIGTERM
W0716 20:53:26.984000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64192 closing signal SIGTERM
W0716 20:53:26.985000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64193 closing signal SIGTERM
W0716 20:53:26.988000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64194 closing signal SIGTERM
W0716 20:53:26.989000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64195 closing signal SIGTERM
W0716 20:53:56.990000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 64191 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0716 20:54:02.560000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 64192 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0716 20:54:14.628000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -9) local_rank: 0 (pid: 64188) of binary: /opt/conda/envs/torchtune/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/envs/torchtune/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main
    parser.run(args)
  File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run
    args.func(args)
  File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
    self._run_distributed(args, is_builtin=is_builtin)
  File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
    run(args)
  File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions