Closed
Description
It takes ~6 minutes to save a checkpoint using non async mode. Is this expected?
Sync mode
[rank0]:[titan] 2025-06-15 21:31:48,968 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250615-2131
[rank0]:[titan] 2025-06-15 21:31:48,969 - root - INFO - CUDA capacity: NVIDIA H100 80GB HBM3 with 79.10GiB memory
[rank0]:[titan] 2025-06-15 21:31:49,083 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:[titan] 2025-06-15 21:31:49,084 - root - INFO - Applied full activation checkpointing to the model
[rank0]:[titan] 2025-06-15 21:31:49,164 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-06-15 21:31:49,505 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-06-15 21:31:49,505 - root - INFO - CUDA memory usage for model: 3.95GiB(4.99%)
[rank0]:[titan] 2025-06-15 21:31:49,535 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/c
heckpoint
[rank0]:[titan] 2025-06-15 21:31:49,535 - root - INFO - Trainer is initialized with local batch size 1, global batch size 64, gradient
accumulation steps 8, sequence length 8192, total steps 1000 (warmup 40).
[rank0]:[titan] 2025-06-15 21:31:49,535 - root - INFO - Loading the checkpoint from assets/models/dcp/llama3.1-8B.
[rank0]:[titan] 2025-06-15 21:32:02,935 - root - INFO - [GC] GC collection for checkpoint loading. 0.01 seconds.
[rank0]:[titan] 2025-06-15 21:32:02,935 - root - INFO - Finished loading the checkpoint in 13.40 seconds.
[rank0]:[titan] 2025-06-15 21:32:02,935 - root - INFO - Training starts at step 1.
[rank0]:[titan] 2025-06-15 21:32:15,816 - root - INFO - step: 1 loss: 2.4292 memory: 29.18GiB(36.90%) tps: 2,452 tflops: 141.98
mfu: 14.36%
[rank0]:[titan] 2025-06-15 21:32:15,816 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-15 21:38:31,430 - root - INFO - [GC] GC collection invoked by checkpointer. 0.04 seconds.
[rank0]:[titan] 2025-06-15 21:38:31,431 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 375.61 secon
ds.
[rank0]:[titan] 2025-06-15 21:38:31,431 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-06-15 21:40:09,439 - root - INFO - step: 10 loss: 2.3602 memory: 36.65GiB(46.33%) tps: 1,245 tflops: 72.12
mfu: 7.29%
Async mode:
rank0]:[titan] 2025-06-15 21:44:35,889 - root - INFO - step: 1 loss: 2.4292 memory: 29.18GiB(36.90%) tps: 2,327 tflops: 134.74 mfu: 13.62%
[rank0]:[titan] 2025-06-15 21:44:35,890 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-15 21:44:35,898 - root - INFO - [GC] GC collection invoked by checkpointer. 0.01 seconds.
[rank0]:[titan] 2025-06-15 21:44:47,661 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-15 21:44:47,672 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 11.78 seconds.
[rank0]:[titan] 2025-06-15 21:44:47,672 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:/home/ubuntu/code/thirdparty/torchtitan/.venv/lib/python3.13/site-packages/torch/distributed/checkpoint/filesystem.py:111: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
[rank0]: if tensor.storage().size() != tensor.numel():
[rank0]:[titan] 2025-06-15 21:46:26,319 - root - INFO - step: 10 loss: 2.3601 memory: 36.64GiB(46.33%) tps: 5,341 tflops: 309.34 mfu: 31.28%
Reproduction: check out #1300 and run
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" uv run ./run_train.sh \
--model.tokenizer_path assets/tokenizer/Meta-Llama-3.1-8B-tokenizer.model \
--training.max_seq_len 131072 \
--checkpoint.initial_load_path "assets/models/dcp/llama3.1-8B" \
--profiling.no_enable_profiling \
--checkpoint.enable_checkpoint \
--checkpoint.async_mode async \
--activation_checkpoint.mode full \
--training.global_batch_size 64 \
--lr_scheduler.warmup_steps 40 \
--optimizer.lr 1e-5