Have you read the Project Process docs?
Summary
- Add support for creating ComputeDomain CRDs natively for workflows that have NVL72 or other multi-node nvlink enabled backends.
- Add support for topology aware scheduling introduced in KAI Scheduler v0.10 to allow users to scheduled all of their tasks of a given workflow on the same NVL72 rack.
Person in Charge (PIC)
@ecolternv
Motivation
The current Blackwell generation of GPUs with GB200 and GB300, and future announced GPU generations feature Multi Node NVLINK with NVL72 and even higher numbers like NVL144, etc.
To properly get full node to node performance for multi-node training, workloads scheduled in kubernetes need to take advantage of NvLink, and carefully control how they nodes are placed in the cluster so they end up in the same racks.
Problem
ComputeDomain CRD:
To use MultiNode NvLink in kubernetes, you must create a ComputeDomain CRD and all pods that are part of a given training run must have a resourceClaim that points to the same ComputeDomain. OSMO does not currently support creating/destroying ComputeDomains along with the lifecycle of the workflow.
Topology Aware Scheduling:
To use MultiNode NvLink, pods that wish to communicate using NvLink must be placed in the same NVL72/NVL144 rack by the scheduler.
For very large training runs, users may want full control over the placement of tasks in racks, for example a training run may be 4 x tensor parallel, 4 x pipeline parallel, and 4 x data parallel. This means there will be 64 pods total: 4 groups of 16 that each represent a single instance of a model. A user may wish to ensure that each group of 16 pods that represent an instance of the model all end up in the same NVL72 rack.
Currently OSMO has some ability to handle topology aware scheduling with podAffinities added to OSMO pod templates, but this doesn't provide the granular level of control needed for many usecases in an NVLink enabled cluster.
References
Have you read the Project Process docs?
Summary
Person in Charge (PIC)
@ecolternv
Motivation
The current Blackwell generation of GPUs with GB200 and GB300, and future announced GPU generations feature Multi Node NVLINK with NVL72 and even higher numbers like NVL144, etc.
To properly get full node to node performance for multi-node training, workloads scheduled in kubernetes need to take advantage of NvLink, and carefully control how they nodes are placed in the cluster so they end up in the same racks.
Problem
ComputeDomain CRD:
To use MultiNode NvLink in kubernetes, you must create a ComputeDomain CRD and all pods that are part of a given training run must have a resourceClaim that points to the same ComputeDomain. OSMO does not currently support creating/destroying ComputeDomains along with the lifecycle of the workflow.
Topology Aware Scheduling:
To use MultiNode NvLink, pods that wish to communicate using NvLink must be placed in the same NVL72/NVL144 rack by the scheduler.
For very large training runs, users may want full control over the placement of tasks in racks, for example a training run may be 4 x tensor parallel, 4 x pipeline parallel, and 4 x data parallel. This means there will be 64 pods total: 4 groups of 16 that each represent a single instance of a model. A user may wish to ensure that each group of 16 pods that represent an instance of the model all end up in the same NVL72 rack.
Currently OSMO has some ability to handle topology aware scheduling with podAffinities added to OSMO pod templates, but this doesn't provide the granular level of control needed for many usecases in an NVLink enabled cluster.
References