why won't auto resubmit work? #3351
Replies: 13 comments
-
after the job finishes, does it completely terminate or is there any process still running in the background? and did you observe checkpoints being saved? |
Beta Was this translation helpful? Give feedback.
-
Yes, the checkpoint is saved. Also, the job terminates completely and the slurm log file says, "Job cancelled because time value exceeded" |
Beta Was this translation helpful? Give feedback.
-
@vr25 mind add important snapshots or your code some it helps also other in the future? |
Beta Was this translation helpful? Give feedback.
-
@Borda Both code and slurm script are linked in the reported bug description. Please let me know if you cannot access it. Thanks! |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
@vr25 I know this is not related to the issue reported, but looking at your script here I noticed that your main script is not guarded by |
Beta Was this translation helpful? Give feedback.
-
@awaelchli Yes, you're right, I just got the code from here. I am sorry but which last line is commented? code or slurm script? |
Beta Was this translation helpful? Give feedback.
-
Here trainer = pl.Trainer(max_epochs=1000, gpus=1) #4, num_nodes=4, distributed_backend='ddp',) There is a comment there, the ddp should be included, im pretty surw |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
@vr25 In your sbatch options, you have this: #SBATCH --time=00:00:02 Seems to me like you're asking for 2 seconds rather than 2 minutes, which would explain why it's not working. There's no time for the code to kick in at this speed 😛 You should try Edit: typo |
Beta Was this translation helpful? Give feedback.
-
oh!! Right, thanks for pointing it out. |
Beta Was this translation helpful? Give feedback.
-
@nathanpainchaud you have good eyes!! |
Beta Was this translation helpful? Give feedback.
-
If the hint by @nathanpainchaud did not work, let us know and we can reopen. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
@awaelchli (just tagged based on the previous issue here)
🐛 Bug
auto resubmit doesn't seem to work
To Reproduce
Steps to reproduce the behavior:
sbatch sample_05_gpu.sh
Code sample
Please take a look at my code and job script
Expected behavior
Should checkpoint and restart from the last checkpoint every 2 minutes.
Environment
torch = 1.3.1
pytorch_lightning = 0.9.0
Beta Was this translation helpful? Give feedback.
All reactions