Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load/Savings Checkpoint Fails using DeepSpeed - GRPO #2787

Open
5 tasks done
zaddy6 opened this issue Feb 6, 2025 · 0 comments
Open
5 tasks done

Load/Savings Checkpoint Fails using DeepSpeed - GRPO #2787

zaddy6 opened this issue Feb 6, 2025 · 0 comments
Labels
🐛 bug Something isn't working 🚀 deepspeed Related to deepspeed

Comments

@zaddy6
Copy link

zaddy6 commented Feb 6, 2025

Reproduction

When using deepspeed, checkpoint saving fails

accelerate launch --num_processes 7 --config_file configs/zero3.yaml src/train_zerox.py \
    --output_dir outputs/Llama-3.1-8B-Instruct-zerox \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --max_prompt_length 512 \
    --max_completion_length 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 3e-6 \
    --adam_beta1 0.9 \
    --adam_beta2 0.99 \
    --weight_decay 0.1 \
    --warmup_ratio 0.1 \
    --logging_steps 1 \
    --num_generations 2 \
    --save_steps 2 \
    --max_steps 1000 \
    --torch_dtype bfloat16 \
    --use_vllm \
    --vllm_gpu_memory_utilization 0.7 \
    --bf16

outputs:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/simple_grpo/src/train_zero.py", line 275, in <module>
[rank1]:     main(training_args, model_args)
[rank1]:   File "/workspace/simple_grpo/src/train_zero.py", line 268, in main
[rank1]:     trainer.train()
[rank1]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2185, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop
[rank1]:     self._maybe_log_save_evaluate(
[rank1]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3035, in _maybe_log_save_evaluate
[rank1]:     self._save_checkpoint(model, trial)
[rank1]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3160, in _save_checkpoint
[rank1]:     shutil.rmtree(checkpoint_dir)
[rank1]:   File "/opt/conda/lib/python3.11/shutil.py", line 752, in rmtree
[rank1]:     _rmtree_safe_fd(fd, path, onerror)
[rank1]:   File "/opt/conda/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
[rank1]:     onerror(os.unlink, fullname, sys.exc_info())
[rank1]:   File "/opt/conda/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
[rank1]:     os.unlink(entry.name, dir_fd=topfd)
[rank1]: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_2.pth'
[rank4]: Traceback (most recent call last):
[rank4]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3157, in _save_checkpoint
[rank4]:     os.renames(output_dir, checkpoint_dir)
[rank4]:   File "<frozen os>", line 272, in renames
[rank4]: FileExistsError: [Errno 17] File exists: 'outputs/Llama-3.1-8B-Instruct-zerox/tmp-checkpoint-g_fa26gf' -> 'outputs/Llama-3.1-8B-Instruct-zerox/checkpoint-2'

[rank4]: During handling of the above exception, another exception occurred:

[rank4]: Traceback (most recent call last):
[rank4]:   File "/workspace/simple_grpo/src/train_zero.py", line 275, in <module>
[rank4]:     main(training_args, model_args)
[rank4]:   File "/workspace/simple_grpo/src/train_zero.py", line 268, in main
[rank4]:     trainer.train()
[rank4]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2185, in train
[rank4]:     return inner_training_loop(
[rank4]:            ^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop
[rank4]:     self._maybe_log_save_evaluate(
[rank4]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3035, in _maybe_log_save_evaluate
[rank4]:     self._save_checkpoint(model, trial)
[rank4]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3160, in _save_checkpoint
[rank4]:     shutil.rmtree(checkpoint_dir)
[rank4]:   File "/opt/conda/lib/python3.11/shutil.py", line 752, in rmtree
[rank4]:     _rmtree_safe_fd(fd, path, onerror)
[rank4]:   File "/opt/conda/lib/python3.11/shutil.py", line 683, in _rmtree_safe_fd
[rank4]:     onerror(os.rmdir, fullname, sys.exc_info())
[rank4]:   File "/opt/conda/lib/python3.11/shutil.py", line 681, in _rmtree_safe_fd
[rank4]:     os.rmdir(entry.name, dir_fd=topfd)
[rank4]: FileNotFoundError: [Errno 2] No such file or directory: 'global_step2'

System Info

  • Platform: Linux-5.15.0-130-generic-x86_64-with-glibc2.35
  • Python version: 3.11.9
  • PyTorch version: 2.5.1
  • CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
  • Transformers version: 4.48.2
  • Accelerate version: 1.3.0
  • Accelerate config: not found
  • Datasets version: 3.2.0
  • HF Hub version: 0.28.1
  • TRL version: 0.15.0.dev0
  • bitsandbytes version: not installed
  • DeepSpeed version: not installed
  • Diffusers version: not installed
  • Liger-Kernel version: 0.5.2
  • LLM-Blender version: not installed
  • OpenAI version: 1.61.1
  • PEFT version: 0.14.0

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete
@github-actions github-actions bot added 🐛 bug Something isn't working 🚀 deepspeed Related to deepspeed labels Feb 6, 2025
@zaddy6 zaddy6 changed the title Load/Savings Checkpoint Fails using DeepSpeed Load/Savings Checkpoint Fails using DeepSpeed - GRPO Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working 🚀 deepspeed Related to deepspeed
Projects
None yet
Development

No branches or pull requests

1 participant