Skip to content

Slow checkpoint saving time (6 mins to save an 8B model checkpoint in sync mode) #1301

Closed
@vwxyzjn

Description

@vwxyzjn

It takes ~6 minutes to save a checkpoint using non async mode. Is this expected?

Sync mode

[rank0]:[titan] 2025-06-15 21:31:48,968 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250615-2131 
[rank0]:[titan] 2025-06-15 21:31:48,969 - root - INFO - CUDA capacity: NVIDIA H100 80GB HBM3 with 79.10GiB memory                     
[rank0]:[titan] 2025-06-15 21:31:49,083 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters                          
[rank0]:[titan] 2025-06-15 21:31:49,084 - root - INFO - Applied full activation checkpointing to the model                            
[rank0]:[titan] 2025-06-15 21:31:49,164 - root - INFO - Applied FSDP to the model                                                     
[rank0]:[titan] 2025-06-15 21:31:49,505 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14                                  
[rank0]:[titan] 2025-06-15 21:31:49,505 - root - INFO - CUDA memory usage for model: 3.95GiB(4.99%)                                   
[rank0]:[titan] 2025-06-15 21:31:49,535 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/c
heckpoint                                                                                                                             
[rank0]:[titan] 2025-06-15 21:31:49,535 - root - INFO - Trainer is initialized with local batch size 1, global batch size 64, gradient
 accumulation steps 8, sequence length 8192, total steps 1000 (warmup 40).                                                            
[rank0]:[titan] 2025-06-15 21:31:49,535 - root - INFO - Loading the checkpoint from assets/models/dcp/llama3.1-8B.                    
[rank0]:[titan] 2025-06-15 21:32:02,935 - root - INFO - [GC] GC collection for checkpoint loading. 0.01 seconds.                      
[rank0]:[titan] 2025-06-15 21:32:02,935 - root - INFO - Finished loading the checkpoint in 13.40 seconds.                             
[rank0]:[titan] 2025-06-15 21:32:02,935 - root - INFO - Training starts at step 1.                                                    
[rank0]:[titan] 2025-06-15 21:32:15,816 - root - INFO - step:  1  loss:  2.4292  memory: 29.18GiB(36.90%)  tps: 2,452  tflops: 141.98 
 mfu: 14.36%                                                                                                                          
[rank0]:[titan] 2025-06-15 21:32:15,816 - root - INFO - Saving the checkpoint (or staging if async is enabled).                       
[rank0]:[titan] 2025-06-15 21:38:31,430 - root - INFO - [GC] GC collection invoked by checkpointer. 0.04 seconds.                     
[rank0]:[titan] 2025-06-15 21:38:31,431 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 375.61 secon
ds.                                                                                                                                   
[rank0]:[titan] 2025-06-15 21:38:31,431 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40          
[rank0]:[titan] 2025-06-15 21:40:09,439 - root - INFO - step: 10  loss:  2.3602  memory: 36.65GiB(46.33%)  tps: 1,245  tflops: 72.12  
mfu: 7.29%  

Async mode:

rank0]:[titan] 2025-06-15 21:44:35,889 - root - INFO - step:  1  loss:  2.4292  memory: 29.18GiB(36.90%)  tps: 2,327  tflops: 134.74  mfu: 13.62%
[rank0]:[titan] 2025-06-15 21:44:35,890 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-15 21:44:35,898 - root - INFO - [GC] GC collection invoked by checkpointer. 0.01 seconds.
[rank0]:[titan] 2025-06-15 21:44:47,661 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-15 21:44:47,672 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 11.78 seconds.
[rank0]:[titan] 2025-06-15 21:44:47,672 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:/home/ubuntu/code/thirdparty/torchtitan/.venv/lib/python3.13/site-packages/torch/distributed/checkpoint/filesystem.py:111: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
[rank0]:  if tensor.storage().size() != tensor.numel():
[rank0]:[titan] 2025-06-15 21:46:26,319 - root - INFO - step: 10  loss:  2.3601  memory: 36.64GiB(46.33%)  tps: 5,341  tflops: 309.34  mfu: 31.28%

Reproduction: check out #1300 and run

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" uv run ./run_train.sh \
  --model.tokenizer_path assets/tokenizer/Meta-Llama-3.1-8B-tokenizer.model \
  --training.max_seq_len 131072 \
  --checkpoint.initial_load_path "assets/models/dcp/llama3.1-8B" \
  --profiling.no_enable_profiling \
  --checkpoint.enable_checkpoint \
  --checkpoint.async_mode async \
  --activation_checkpoint.mode full \
  --training.global_batch_size 64 \
  --lr_scheduler.warmup_steps 40 \
  --optimizer.lr 1e-5

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions