[Req] LSF scheduler support #441

ckddls1321 · 2022-03-29T04:47:30Z

Description

LSF scheduler support
Does torchx team have plan to support LSF scheduler?
Or is there any guide for extension, I would make PR.

Motivation/Background

Thanks for torchx utils. We can target various scheduler by configure torchxconfig.

Detailed Proposal

It would be better to support LSF scheduler.

kiukchung · 2022-03-30T20:11:29Z

Hi there, adding a new scheduler to TorchX is quite straight forward. Here are the basic steps:

Subclass the torchx.schedulers.Scheduler interface. There are a few methods you need to implement - the APIs on the interface describes what each API should do and any assumptions.
(Optional) Register the new scheduler implementation in the list of default schedulers. You only need to do this if you want everyone else to have access to the LSF scheduler. Otherwise you can register it only for yourself via python entrypoints as described here.
The unittests for each file/function you add should go in the **/test directory as {file_name}_test.py our CI picks up all the *_test.py automatically from **/test directories.

You can check out the AWS Batch and Slurm scheduler implementations for reference.

No need to do anything special for .torchxconfig to pick up the settings for the new scheduler. You can add a section like

# .torchxconfig
[lsf]
runcfg1 = value1
runcfg2 = value2
...

d4l3k · 2022-03-31T19:09:06Z

@ckddls1321 where are you running LSF? Are you trying to use this for work purposes?

d4l3k · 2022-03-31T21:10:28Z

@ckddls1321 How are you packaging code for running on NFS is it using Podman/Singularity? Or just a shared NFS mount like slurm?

A friend from Oak Ridge National Laboratory pointed me to https://code.ornl.gov/olcf-analytics/summit/distributed-deep-learning-examples/-/tree/master/examples/pytorch/BERT which is an example of how to run BERT on Summit super computer via NSF.

Summit supports using Podman so maps well to our docker usage https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=containers-lsf-podman

ckddls1321 · 2022-04-05T06:47:32Z

@ckddls1321 where are you running LSF? Are you trying to use this for work purposes?

I consider to use Torchx for work purpose and my personal research interest.
Thanks for suggestion. I will take a look into Podman.
We also have same strategy as Summit does. But we use mpi to launch distributed process.

takeshi-yoshimura · 2022-08-25T13:20:24Z

Hi,
@ckddls1321 @d4l3k @kiukchung
I created LSF scheduler with Docker/Singularity and NFS. please check my PR #588

d4l3k · 2022-10-10T22:27:47Z

Landed as part of 6360df3

d4l3k added enhancement New feature or request module: runner issues related to the torchx.runner and torchx.scheduler modules labels Mar 30, 2022

d4l3k added the scheduler-request New scheduler requests label May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Req] LSF scheduler support #441

[Req] LSF scheduler support #441

ckddls1321 commented Mar 29, 2022 •

edited

Loading

kiukchung commented Mar 30, 2022 •

edited

Loading

d4l3k commented Mar 31, 2022

d4l3k commented Mar 31, 2022 •

edited

Loading

ckddls1321 commented Apr 5, 2022

takeshi-yoshimura commented Aug 25, 2022

d4l3k commented Oct 10, 2022

[Req] LSF scheduler support #441

[Req] LSF scheduler support #441

Comments

ckddls1321 commented Mar 29, 2022 • edited Loading

Description

Motivation/Background

Detailed Proposal

kiukchung commented Mar 30, 2022 • edited Loading

d4l3k commented Mar 31, 2022

d4l3k commented Mar 31, 2022 • edited Loading

ckddls1321 commented Apr 5, 2022

takeshi-yoshimura commented Aug 25, 2022

d4l3k commented Oct 10, 2022

ckddls1321 commented Mar 29, 2022 •

edited

Loading

kiukchung commented Mar 30, 2022 •

edited

Loading

d4l3k commented Mar 31, 2022 •

edited

Loading