-
-
Notifications
You must be signed in to change notification settings - Fork 52
Make batch.id robust to warning messages from sbatch #314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
You might want to create an issue for this that reference this pull request. At least I tend to miss or forget about PR-only issues over time, and I know other repos like an issue with details where discussions can take place. Now, I had a look at Line 44 in 7763ed8
That captures both stdout and stderr. It could be that it would be more sane if those two are captured separately, e.g. something like $ sbatch --time=00:01:00 --mem=128G --wrap="hostname" > stdout.log 2> stderr.log what does $ cat stdout.log
$ cat stderr.log output? With Slurm, you should see "Submitted batch job ..." in |
Nice!
It looks like you can do a cleaner fix than what I came up with. |
I've been prototyping with a more flexible @bwcompton , although it's future.batchtools and not batchtools, could you please give it a spin? If it works, then I can propose this newer To try it out, install it as: remotes::install_github("futureverse/future.batchtools", ref="develop") and then try it as: library(future)
plan(future.batchtools::batchtools_slurm)
f <- future({ Sys.info()[["nodename"]] })
v <- value(f)
print(v) See https://future.batchtools.futureverse.org/reference/batchtools_slurm.html for how to control sbatch resource specifications. |
Thanks! I tried your code snippet, and it can't find slurm_script. Am
I missing something?
Brad
library(future)> plan(future.batchtools::batchtools_slurm)> f <- future({ Sys.info()[["nodename"]] })> v <- value(f)Error: Future (<unnamed-1>) of class BatchtoolsSlurmFuture expired, which indicates that it crashed or was killed.
Post-mortem details:
Future state: ‘running’
Batchtools status: ‘defined’, ‘expired’, ‘submitted’
Slurm job ID: [n=1] ‘43049392’
Slurm 'squeue' job status: <empty>
Slurm 'sacct' job status: 43049392|FAILED|1:0
The last few lines of the logged output:
Session information:
- timestamp: 2025-09-12 14:36:54+0000
- hostname: cpu016
- Rscript path:
/var/spool/slurm/slurmd/job43049392/slurm_script: line 20: Rscript:
command not found
- Rscript version:
/var/spool/slurm/slurmd/job43049392/slurm_script: line 21: Rscript:
command not found
- Rscript library paths:
Rscript -e 'batchtools::doJobCollection()' ...
- job name: 'jobb9686511f15322fe9d3568b52c61e703'
- job log file:
'/work/pi_cschweik_umass_edu/marsh_mapping/salt-marsh-mapping/.future/20250912_143653-MdNjCh/batchtools_1109039380/logs/jobb9686511f15322fe9d3568b52c61e703.log'
- job uri: '/work/pi_cschwe
In addition: Warning messages:
1: batchtools::waitForJobs(..., timeout = 2592000) returned FALSE
2: In delete.BatchtoolsFuture(future) :
Will not remove batchtools registry, because the status of the
batchtools was ‘error’, ‘defined’, ‘expired’, ‘submitted’ and future
backend argument 'delete' is ‘on-success’:
‘/work/pi_cschweik_umass_edu/marsh_mapping/salt-marsh-mapping/.future/20250912_143653-MdNjCh/batchtools_1109039380’>
…On Fri, Sep 12, 2025 at 12:40 AM Henrik Bengtsson ***@***.***> wrote:
*HenrikBengtsson* left a comment (mlr-org/batchtools#314)
<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmlr-org%2Fbatchtools%2Fpull%2F314%23issuecomment-3283634371&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358099900%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=XSa2XbNjVl2pEPjiaPXiUSbZBlFeMfOnjzt%2BWHgnS4c%3D&reserved=0>
I've been prototyping with a more flexible runOSCommand() in my
*future.batchtools* package. It has new arguments stdout and stderr with
default stdout = TRUE and stderr = TRUE (backward compatible). The
special stderr = NA with capture stderr separately from stdout.
@bwcompton
<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbwcompton&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358131610%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=vTGFeNjU5AT84YQi7cImnSLAgErc%2FccVCsEk7YVPUX8%3D&reserved=0>
, although it's *future.batchtools* and not *batchtools*, could you
please give it a spin? If it works, then I can propose this newer
runOSCommand() version to *batchtools*, plus adjustments to
makeClusterFunctionSlurm(), which I also patch in *future.batchtools*.
To try it out, install it as:
remotes::install_github("futureverse/future.batchtools", ref="develop")
and then try it as:
library(future)
plan(future.batchtools::batchtools_slurm)f <- future({ Sys.info()[["nodename"]] })v <- value(f)
print(v)
See
https://future.batchtools.futureverse.org/reference/batchtools_slurm.html
<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffuture.batchtools.futureverse.org%2Freference%2Fbatchtools_slurm.html&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358143281%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=u%2BGwQhkidnbRGl%2B7%2BEhIoDeTG3Ad4EtkBfRWJW8y1PQ%3D&reserved=0>
for how to control sbatch resource specifications.
—
Reply to this email directly, view it on GitHub
<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmlr-org%2Fbatchtools%2Fpull%2F314%23issuecomment-3283634371&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358155056%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=GUmEXkgvmyPWWMJhaP1xc%2Btun4fBFDFOIhHQGag6NsQ%3D&reserved=0>,
or unsubscribe
<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAUIZI2VZFGCGL3NUUAKXKZL3SJFD3AVCNFSM6AAAAAB7G4SBCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEOBTGYZTIMZXGE&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358166124%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=Qg2x%2FPh2UFME%2FwznQtEl24kLIxJHVvEeoj7KqoM0d0I%3D&reserved=0>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
R is not available by default in your jobs. Do you load an environment module to get access to R? If so, specify that I'm in the
This is illustrated also in https://future.batchtools.futureverse.org/reference/batchtools_slurm.html If you use other techniques to make R available in a job script, please let me know |
That said, the job submission itself actually worked! It's just that R didn't start, which means the patch works |
Great news that the patch works. Here's what I've got in my template,
|
Unfortunately not possible today; you'd have to create your own custom template file. But, I've created futureverse/future.batchtools#99 to add support for this too. Stay tuned. |
Okay, I'll look forward to future.batchtools in the future. Do you have what you need from me to address the original issue in this PR? |
Yes, I'd like to have a success story over at future.batchtools first, ideally some mileage from other users, and have my patch "ripe" enough, before I "bug" the batchtools maintainers here. So, I'll ping you again over at futureverse/future.batchtools#99 for you to test. Thanks. |
Deal! Thanks so much for your help with this. |
I ran into a crazy bug today:
getJobStatus
gave mebatch.id = "that"
. It turns out that when I requested a large amount of memory,sbatch
returned this um, helpful message:clusterFunctionsSlurm
was pulling the 4th word of the first line, which should have been the Slurm jobid, but instead was "that". It wanted, of course, the last line.This really isn't a bug in
batchtools
, as the sysops inserted an informational message in a crazy place. But I suspect if the smart, on the ball people at the UMass Unity cluster are doing this, others probably are too. It'd be nice forbatchtools
to be robust to such shenanigans. Alternatively, I suppose it could throw an error if batch.id is non-numeric and print the message fromsbatch
.My suggested change looks for a line beginning with "Submitted batch job" and pulls the 4th word as the
batch.id
.I've tested this change against the following:
as well as against real-life
submitJobs
calls, both with and without the informational message.