Skip to content

Conversation

bwcompton
Copy link

I ran into a crazy bug today: getJobStatus gave me batch.id = "that". It turns out that when I requested a large amount of memory, sbatch returned this um, helpful message:

sbatch: INFO: Note that 128 GB per node will require a node with more than 128 GB memory 
because of overhead. Check https://docs.unity.rc.umass.edu/nodes for an appropriate limit.
Submitted batch job 38139957

clusterFunctionsSlurm was pulling the 4th word of the first line, which should have been the Slurm jobid, but instead was "that". It wanted, of course, the last line.

This really isn't a bug in batchtools, as the sysops inserted an informational message in a crazy place. But I suspect if the smart, on the ball people at the UMass Unity cluster are doing this, others probably are too. It'd be nice for batchtools to be robust to such shenanigans. Alternatively, I suppose it could throw an error if batch.id is non-numeric and print the message from sbatch.

My suggested change looks for a line beginning with "Submitted batch job" and pulls the 4th word as the batch.id.

I've tested this change against the following:

output <- 'Submitted batch job 12345678'
output <- 'This is a crazy informational message\nSubmitted batch job 98765432'
output <- 'This is crazy\nand uncalled for\nSubmitted batch job 5555555\nand even more stuff'

as well as against real-life submitJobs calls, both with and without the informational message.

@HenrikBengtsson
Copy link

HenrikBengtsson commented Sep 5, 2025

You might want to create an issue for this that reference this pull request. At least I tend to miss or forget about PR-only issues over time, and I know other repos like an issue with details where discussions can take place.

Now, I had a look at runOSCommand(), which is what captures the output per

res = suppressWarnings(system2(command = sys.cmd, args = sys.args, stdin = stdin, stdout = TRUE, stderr = TRUE, wait = TRUE))

That captures both stdout and stderr. It could be that it would be more sane if those two are captured separately, e.g. something like stdout = TRUE and stderr = "error.log", where the expected output should go to stdout and info messages to stderr. To test if that would have helped you, if you do

$ sbatch --time=00:01:00 --mem=128G --wrap="hostname" > stdout.log 2> stderr.log

what does

$ cat stdout.log
$ cat stderr.log

output? With Slurm, you should see "Submitted batch job ..." in stdout.log. Now, my hope is that "sbatch: INFO: Note that 128 GB per node will require a node with more than 128 GB memory because of overhead. Check https://docs.unity.rc.umass.edu/nodes for an appropriate limit." ends up in stderr.log for you.

@bwcompton
Copy link
Author

Nice!

bcompton_umass_edu@login1:~$ sbatch --time=00:01:00 --mem=128G --wrap="hostname" > stdout.log 2> stderr.log
bcompton_umass_edu@login1:~$ cat stdout.log
Submitted batch job 42933105
bcompton_umass_edu@login1:~$ cat stderr.log
sbatch: INFO: Note that 128 GB per node will require a node with more than 128 GB memory because of overhead. Check https://docs.unity.rc.umass.edu/nodes for an appropriate limit.
bcompton_umass_edu@login1:~$

It looks like you can do a cleaner fix than what I came up with.

@HenrikBengtsson
Copy link

I've been prototyping with a more flexible runOSCommand() in my future.batchtools package. It has new arguments stdout and stderr with default stdout = TRUE and stderr = TRUE (backward compatible). The special stderr = NA with capture stderr separately from stdout.

@bwcompton , although it's future.batchtools and not batchtools, could you please give it a spin? If it works, then I can propose this newer runOSCommand() version to batchtools, plus adjustments to makeClusterFunctionSlurm(), which I also patch in future.batchtools.

To try it out, install it as:

remotes::install_github("futureverse/future.batchtools", ref="develop")

and then try it as:

library(future)
plan(future.batchtools::batchtools_slurm)
f <- future({  Sys.info()[["nodename"]] })
v <- value(f)
print(v)

See https://future.batchtools.futureverse.org/reference/batchtools_slurm.html for how to control sbatch resource specifications.

@bwcompton
Copy link
Author

bwcompton commented Sep 12, 2025 via email

@HenrikBengtsson
Copy link

Rscript: command not found

R is not available by default in your jobs. Do you load an environment module to get access to R? If so, specify that I'm in the resources argument, e.g.

plan(future.batchtools::batchtools_slurm, resources = list(modules = "r"))

This is illustrated also in https://future.batchtools.futureverse.org/reference/batchtools_slurm.html

If you use other techniques to make R available in a job script, please let me know

@HenrikBengtsson
Copy link

That said, the job submission itself actually worked! It's just that R didn't start, which means the patch works

@bwcompton
Copy link
Author

Great news that the patch works.

Here's what I've got in my template, slurm.tmpl. I'm not sure how to squeeze this into the resources option--this is something I got help with from a sysadmin. It works great with batchtools.

## Call batchtools inside container
module load apptainer/latest
export APPTAINER_BINDPATH="/run/munge,/var/run/munge,/etc/slurm,/var/spool/slurm/slurmd/conf-cache/slurm.conf,$APPTAINER_BINDPATH"

apptainer exec /modules/admin-resources/ood-dev/unity-r_4.4.0.sif Rscript --no-restore --quiet --no-save -e 'batchtools::doJobCollection("<%= uri %>")'

@HenrikBengtsson
Copy link

I'm not sure how to squeeze this into the resources option

Unfortunately not possible today; you'd have to create your own custom template file. But, I've created futureverse/future.batchtools#99 to add support for this too. Stay tuned.

@bwcompton
Copy link
Author

Okay, I'll look forward to future.batchtools in the future.

Do you have what you need from me to address the original issue in this PR?

@HenrikBengtsson
Copy link

Do you have what you need from me to address the original issue in this PR?

Yes, I'd like to have a success story over at future.batchtools first, ideally some mileage from other users, and have my patch "ripe" enough, before I "bug" the batchtools maintainers here. So, I'll ping you again over at futureverse/future.batchtools#99 for you to test. Thanks.

@bwcompton
Copy link
Author

Deal! Thanks so much for your help with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants