-
Notifications
You must be signed in to change notification settings - Fork 669
Open
Description
I am running this with nightlies
Logs
tune run --nproc_per_node 8 full_finetune_distributed --config recipes/configs/llama4/scout_17B_16E_full.yaml batch_size=4 epochs=10
Running with torchrun...
W0716 20:03:05.675000 63909 site-packages/torch/distributed/run.py:774]
W0716 20:03:05.675000 63909 site-packages/torch/distributed/run.py:774] *****************************************
W0716 20:03:05.675000 63909 site-packages/torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0716 20:03:05.675000 63909 site-packages/torch/distributed/run.py:774] *****************************************
INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:
batch_size: 4
batch_size_val: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: '00050'
model_type: LLAMA4
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
dataset_val:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
_component_: torch.nn.CrossEntropyLoss
ignore_index: -100
reduction: mean
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
_component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
_component_: torch.optim.AdamW
fused: false
lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
_component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
_component_: torchtune.models.llama4.llama4_transform
max_num_tiles: 16
max_seq_len: null
path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model
INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:
batch_size: 4
batch_size_val: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: '00050'
model_type: LLAMA4
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
dataset_val:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
_component_: torch.nn.CrossEntropyLoss
ignore_index: -100
reduction: mean
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
_component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
_component_: torch.optim.AdamW
fused: false
lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
_component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
_component_: torchtune.models.llama4.llama4_transform
max_num_tiles: 16
max_seq_len: null
path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model
INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:
batch_size: 4
batch_size_val: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: '00050'
model_type: LLAMA4
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
dataset_val:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
_component_: torch.nn.CrossEntropyLoss
ignore_index: -100
reduction: mean
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
_component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
_component_: torch.optim.AdamW
fused: false
lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
_component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
_component_: torchtune.models.llama4.llama4_transform
max_num_tiles: 16
max_seq_len: null
path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model
INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:
batch_size: 4
batch_size_val: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: '00050'
model_type: LLAMA4
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
dataset_val:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
_component_: torch.nn.CrossEntropyLoss
ignore_index: -100
reduction: mean
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
_component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
_component_: torch.optim.AdamW
fused: false
lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
_component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
_component_: torchtune.models.llama4.llama4_transform
max_num_tiles: 16
max_seq_len: null
path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model
INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:
batch_size: 4
batch_size_val: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: '00050'
model_type: LLAMA4
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
dataset_val:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
_component_: torch.nn.CrossEntropyLoss
ignore_index: -100
reduction: mean
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
_component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
_component_: torch.optim.AdamW
fused: false
lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
_component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
_component_: torchtune.models.llama4.llama4_transform
max_num_tiles: 16
max_seq_len: null
path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model
INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:
batch_size: 4
batch_size_val: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: '00050'
model_type: LLAMA4
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
dataset_val:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
_component_: torch.nn.CrossEntropyLoss
ignore_index: -100
reduction: mean
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
_component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
_component_: torch.optim.AdamW
fused: false
lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
_component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
_component_: torchtune.models.llama4.llama4_transform
max_num_tiles: 16
max_seq_len: null
path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model
INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:
batch_size: 4
batch_size_val: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: '00050'
model_type: LLAMA4
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
dataset_val:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
_component_: torch.nn.CrossEntropyLoss
ignore_index: -100
reduction: mean
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
_component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
_component_: torch.optim.AdamW
fused: false
lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
_component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
_component_: torchtune.models.llama4.llama4_transform
max_num_tiles: 16
max_seq_len: null
path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model
INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:
batch_size: 4
batch_size_val: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: '00050'
model_type: LLAMA4
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
recipe_checkpoint: null
clip_grad_norm: null
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
dataset_val:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: openai
data_files: /mnt/disks/data/torchtune/data/train_top_1000_ankith.json
packed: false
source: json
split: train[:95%]
train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 10
fsdp_cpu_offload: true
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
_component_: torch.nn.CrossEntropyLoss
ignore_index: -100
reduction: mean
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs
model:
_component_: torchtune.models.llama4.llama4_scout_17b_16e
optimizer:
_component_: torch.optim.AdamW
fused: false
lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 60
seed: null
shuffle: true
tensor_parallel_dim: 2
tensor_parallel_plan:
_component_: torchtune.models.llama4.decoder_only_tp_plan
tokenizer:
_component_: torchtune.models.llama4.llama4_transform
max_num_tiles: 16
max_seq_len: null
path: /mnt/disks/data/hf_weights/Llama-4-Scout-17B-16E-Instruct/tokenizer.model
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank [Gloo] Rank 62 is connected to is connected to 77 peer ranks. peer ranks. Expected number of connected peer ranks is : Expected number of connected peer ranks is : 77
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank [Gloo] Rank 3 is connected to 3 peer ranks. 2Expected number of connected peer ranks is : is connected to 33
peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank [Gloo] Rank [Gloo] Rank 100 is connected to is connected to is connected to 111 peer ranks. peer ranks. peer ranks. Expected number of connected peer ranks is : Expected number of connected peer ranks is : Expected number of connected peer ranks is : 111
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank [Gloo] Rank 1 is connected to 3 peer ranks. 2Expected number of connected peer ranks is : is connected to 33
peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank [Gloo] Rank 3 is connected to 32 peer ranks. is connected to Expected number of connected peer ranks is : 33 peer ranks.
Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank [Gloo] Rank 3 is connected to 23 is connected to peer ranks. 3Expected number of connected peer ranks is : peer ranks. 3Expected number of connected peer ranks is :
3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO:torchtune.utils._logging:Hint: enable_activation_checkpointing is True, but enable_activation_offloading isn't. Enabling activation offloading should reduce memory further.
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
INFO:torchtune.utils._logging:Set intra op parallelism no. of threads to 26
Writing logs to /mnt/disks/data/logs/torchtune/llama4_17Bx16E/full/logs/log_1752696194.txt
INFO:torchtune.utils._logging:Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
INFO:torchtune.utils._logging:Instantiating model and loading checkpoint took 410.16 secs
INFO:torchtune.utils._logging:Memory stats after model init:
GPU peak memory active: 12.54 GiB
GPU peak memory alloc: 12.54 GiB
GPU peak memory reserved: 12.55 GiB
INFO:torchtune.utils._logging:Optimizer is initialized.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
Error
1|59|Loss: 0.8372806906700134: 100%|██████████████████████████████████████████████████████████| 59/59 [36:29<00:00, 33.91s/it]INFO:torchtune.utils._logging:Saving checkpoint. This may take some time. Retrieving full model state dict...
INFO:torchtune.utils._logging:Getting full model state dict took 70.13 secs
INFO:torchtune.utils._logging:Getting optimizer state dict...
INFO:torchtune.utils._logging:Getting optimizer state dict took 291.04 secs
[rank7]: Traceback (most recent call last):
[rank7]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank7]: sys.exit(recipe_main())
[rank7]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank7]: sys.exit(recipe_main(conf))
[rank7]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank7]: recipe.train()
[rank7]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank7]: self._checkpoint_client.save_checkpoint(
[rank7]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank7]: self._save_checkpoint_sync(
[rank7]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank7]: torch.distributed.barrier()
[rank7]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank7]: return func(*args, **kwargs)
[rank7]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank7]: work.wait()
[rank7]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:30610
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank1]: sys.exit(recipe_main())
[rank1]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank1]: sys.exit(recipe_main(conf))
[rank1]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank1]: recipe.train()
[rank1]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank1]: self._checkpoint_client.save_checkpoint(
[rank1]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank1]: self._save_checkpoint_sync(
[rank1]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank1]: torch.distributed.barrier()
[rank1]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank1]: work.wait()
[rank1]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:30610
[rank4]: Traceback (most recent call last):
[rank4]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank4]: sys.exit(recipe_main())
[rank4]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank4]: sys.exit(recipe_main(conf))
[rank4]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank4]: recipe.train()
[rank4]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank4]: self._checkpoint_client.save_checkpoint(
[rank4]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank4]: self._save_checkpoint_sync(
[rank4]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank4]: torch.distributed.barrier()
[rank4]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank4]: return func(*args, **kwargs)
[rank4]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank4]: work.wait()
[rank4]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:30610
[rank2]: Traceback (most recent call last):
[rank2]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank2]: sys.exit(recipe_main())
[rank2]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank2]: sys.exit(recipe_main(conf))
[rank2]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank2]: recipe.train()
[rank2]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank2]: self._checkpoint_client.save_checkpoint(
[rank2]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank2]: self._save_checkpoint_sync(
[rank2]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank2]: torch.distributed.barrier()
[rank2]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank2]: work.wait()
[rank2]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:42181
[rank6]: Traceback (most recent call last):
[rank6]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank6]: sys.exit(recipe_main())
[rank6]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank6]: sys.exit(recipe_main(conf))
[rank6]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank6]: recipe.train()
[rank6]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank6]: self._checkpoint_client.save_checkpoint(
[rank6]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank6]: self._save_checkpoint_sync(
[rank6]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank6]: torch.distributed.barrier()
[rank6]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank6]: return func(*args, **kwargs)
[rank6]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank6]: work.wait()
[rank6]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:30610
[rank4]:[W716 20:53:12.395308535 ProcessGroupNCCL.cpp:1566] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank2]:[W716 20:53:12.423445497 ProcessGroupNCCL.cpp:1566] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank6]:[W716 20:53:14.958723379 ProcessGroupNCCL.cpp:1566] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank1]:[W716 20:53:14.022187294 ProcessGroupNCCL.cpp:1566] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank3]: Traceback (most recent call last):
[rank3]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank3]: sys.exit(recipe_main())
[rank3]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank3]: sys.exit(recipe_main(conf))
[rank3]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank3]: recipe.train()
[rank3]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank3]: self._checkpoint_client.save_checkpoint(
[rank3]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank3]: self._save_checkpoint_sync(
[rank3]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank3]: torch.distributed.barrier()
[rank3]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank3]: work.wait()
[rank3]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:36068
[rank5]: Traceback (most recent call last):
[rank5]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1117, in <module>
[rank5]: sys.exit(recipe_main())
[rank5]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank5]: sys.exit(recipe_main(conf))
[rank5]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1112, in recipe_main
[rank5]: recipe.train()
[rank5]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1075, in train
[rank5]: self._checkpoint_client.save_checkpoint(
[rank5]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 393, in save_checkpoint
[rank5]: self._save_checkpoint_sync(
[rank5]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/training/checkpointing/_checkpoint_client.py", line 355, in _save_checkpoint_sync
[rank5]: torch.distributed.barrier()
[rank5]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank5]: return func(*args, **kwargs)
[rank5]: File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4820, in barrier
[rank5]: work.wait()
[rank5]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [10.138.0.25]:61150
W0716 20:53:26.982000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64189 closing signal SIGTERM
W0716 20:53:26.983000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64190 closing signal SIGTERM
W0716 20:53:26.983000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64191 closing signal SIGTERM
W0716 20:53:26.984000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64192 closing signal SIGTERM
W0716 20:53:26.985000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64193 closing signal SIGTERM
W0716 20:53:26.988000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64194 closing signal SIGTERM
W0716 20:53:26.989000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 64195 closing signal SIGTERM
W0716 20:53:56.990000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 64191 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0716 20:54:02.560000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 64192 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0716 20:54:14.628000 63909 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -9) local_rank: 0 (pid: 64188) of binary: /opt/conda/envs/torchtune/bin/python3.10
Traceback (most recent call last):
File "/opt/conda/envs/torchtune/bin/tune", line 8, in <module>
sys.exit(main())
File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main
parser.run(args)
File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run
args.func(args)
File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
self._run_distributed(args, is_builtin=is_builtin)
File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
run(args)
File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/conda/envs/torchtune/lib/python3.10/site-packages/recipes/full_finetune_distributed.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Metadata
Metadata
Assignees
Labels
No labels