Skip to content

[Req] LSF scheduler support #441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ckddls1321 opened this issue Mar 29, 2022 · 6 comments
Open

[Req] LSF scheduler support #441

ckddls1321 opened this issue Mar 29, 2022 · 6 comments
Labels
enhancement New feature or request module: runner issues related to the torchx.runner and torchx.scheduler modules scheduler-request New scheduler requests

Comments

@ckddls1321
Copy link

ckddls1321 commented Mar 29, 2022

Description

LSF scheduler support
Does torchx team have plan to support LSF scheduler?
Or is there any guide for extension, I would make PR.

Motivation/Background

Thanks for torchx utils. We can target various scheduler by configure torchxconfig.

Detailed Proposal

It would be better to support LSF scheduler.

@d4l3k d4l3k added enhancement New feature or request module: runner issues related to the torchx.runner and torchx.scheduler modules labels Mar 30, 2022
@kiukchung
Copy link
Contributor

kiukchung commented Mar 30, 2022

Hi there, adding a new scheduler to TorchX is quite straight forward. Here are the basic steps:

  1. Subclass the torchx.schedulers.Scheduler interface. There are a few methods you need to implement - the APIs on the interface describes what each API should do and any assumptions.
  2. (Optional) Register the new scheduler implementation in the list of default schedulers. You only need to do this if you want everyone else to have access to the LSF scheduler. Otherwise you can register it only for yourself via python entrypoints as described here.
  3. The unittests for each file/function you add should go in the **/test directory as {file_name}_test.py our CI picks up all the *_test.py automatically from **/test directories.

You can check out the AWS Batch and Slurm scheduler implementations for reference.

No need to do anything special for .torchxconfig to pick up the settings for the new scheduler. You can add a section like

# .torchxconfig
[lsf]
runcfg1 = value1
runcfg2 = value2
...

@d4l3k
Copy link
Member

d4l3k commented Mar 31, 2022

@ckddls1321 where are you running LSF? Are you trying to use this for work purposes?

@d4l3k
Copy link
Member

d4l3k commented Mar 31, 2022

@ckddls1321 How are you packaging code for running on NFS is it using Podman/Singularity? Or just a shared NFS mount like slurm?

A friend from Oak Ridge National Laboratory pointed me to https://code.ornl.gov/olcf-analytics/summit/distributed-deep-learning-examples/-/tree/master/examples/pytorch/BERT which is an example of how to run BERT on Summit super computer via NSF.

Summit supports using Podman so maps well to our docker usage https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=containers-lsf-podman

@ckddls1321
Copy link
Author

@ckddls1321 where are you running LSF? Are you trying to use this for work purposes?

I consider to use Torchx for work purpose and my personal research interest.
Thanks for suggestion. I will take a look into Podman.
We also have same strategy as Summit does. But we use mpi to launch distributed process.

@d4l3k d4l3k added the scheduler-request New scheduler requests label May 12, 2022
@takeshi-yoshimura
Copy link
Contributor

Hi,
@ckddls1321 @d4l3k @kiukchung
I created LSF scheduler with Docker/Singularity and NFS. please check my PR #588

@d4l3k
Copy link
Member

d4l3k commented Oct 10, 2022

Landed as part of 6360df3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request module: runner issues related to the torchx.runner and torchx.scheduler modules scheduler-request New scheduler requests
Projects
None yet
Development

No branches or pull requests

4 participants