-
Notifications
You must be signed in to change notification settings - Fork 129
[Req] LSF scheduler support #441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi there, adding a new scheduler to TorchX is quite straight forward. Here are the basic steps:
You can check out the AWS Batch and Slurm scheduler implementations for reference. No need to do anything special for
|
@ckddls1321 where are you running LSF? Are you trying to use this for work purposes? |
@ckddls1321 How are you packaging code for running on NFS is it using Podman/Singularity? Or just a shared NFS mount like slurm? A friend from Oak Ridge National Laboratory pointed me to https://code.ornl.gov/olcf-analytics/summit/distributed-deep-learning-examples/-/tree/master/examples/pytorch/BERT which is an example of how to run BERT on Summit super computer via NSF. Summit supports using Podman so maps well to our docker usage https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=containers-lsf-podman |
I consider to use Torchx for work purpose and my personal research interest. |
Hi, |
Landed as part of 6360df3 |
Description
LSF scheduler support
Does torchx team have plan to support LSF scheduler?
Or is there any guide for extension, I would make PR.
Motivation/Background
Thanks for torchx utils. We can target various scheduler by configure torchxconfig.
Detailed Proposal
It would be better to support LSF scheduler.
The text was updated successfully, but these errors were encountered: