You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[rank1]: Traceback (most recent call last):
[rank1]: File "/workspace/simple_grpo/src/train_zero.py", line 275, in <module>
[rank1]: main(training_args, model_args)
[rank1]: File "/workspace/simple_grpo/src/train_zero.py", line 268, in main
[rank1]: trainer.train()
[rank1]: File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2185, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop
[rank1]: self._maybe_log_save_evaluate(
[rank1]: File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3035, in _maybe_log_save_evaluate
[rank1]: self._save_checkpoint(model, trial)
[rank1]: File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3160, in _save_checkpoint
[rank1]: shutil.rmtree(checkpoint_dir)
[rank1]: File "/opt/conda/lib/python3.11/shutil.py", line 752, in rmtree
[rank1]: _rmtree_safe_fd(fd, path, onerror)
[rank1]: File "/opt/conda/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
[rank1]: onerror(os.unlink, fullname, sys.exc_info())
[rank1]: File "/opt/conda/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
[rank1]: os.unlink(entry.name, dir_fd=topfd)
[rank1]: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_2.pth'
[rank4]: Traceback (most recent call last):
[rank4]: File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3157, in _save_checkpoint
[rank4]: os.renames(output_dir, checkpoint_dir)
[rank4]: File "<frozen os>", line 272, in renames
[rank4]: FileExistsError: [Errno 17] File exists: 'outputs/Llama-3.1-8B-Instruct-zerox/tmp-checkpoint-g_fa26gf' -> 'outputs/Llama-3.1-8B-Instruct-zerox/checkpoint-2'
[rank4]: During handling of the above exception, another exception occurred:
[rank4]: Traceback (most recent call last):
[rank4]: File "/workspace/simple_grpo/src/train_zero.py", line 275, in <module>
[rank4]: main(training_args, model_args)
[rank4]: File "/workspace/simple_grpo/src/train_zero.py", line 268, in main
[rank4]: trainer.train()
[rank4]: File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2185, in train
[rank4]: return inner_training_loop(
[rank4]: ^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop
[rank4]: self._maybe_log_save_evaluate(
[rank4]: File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3035, in _maybe_log_save_evaluate
[rank4]: self._save_checkpoint(model, trial)
[rank4]: File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3160, in _save_checkpoint
[rank4]: shutil.rmtree(checkpoint_dir)
[rank4]: File "/opt/conda/lib/python3.11/shutil.py", line 752, in rmtree
[rank4]: _rmtree_safe_fd(fd, path, onerror)
[rank4]: File "/opt/conda/lib/python3.11/shutil.py", line 683, in _rmtree_safe_fd
[rank4]: onerror(os.rmdir, fullname, sys.exc_info())
[rank4]: File "/opt/conda/lib/python3.11/shutil.py", line 681, in _rmtree_safe_fd
[rank4]: os.rmdir(entry.name, dir_fd=topfd)
[rank4]: FileNotFoundError: [Errno 2] No such file or directory: 'global_step2'
Reproduction
When using deepspeed, checkpoint saving fails
outputs:
System Info
Checklist
The text was updated successfully, but these errors were encountered: