Project: Native Support for Multi-node NvLink

### Have you read the Project Process docs?

- [x] Yes, I have read and understood the RFC docs

### Summary

- Add support for creating [ComputeDomain CRDs](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-cds.html#computedomains-multi-node-nvlink-simplified) natively for workflows that have NVL72 or other multi-node nvlink enabled backends.
- Add support for topology aware scheduling introduced in KAI Scheduler v0.10 to allow users to scheduled all of their tasks of a given workflow on the same NVL72 rack.

### Person in Charge (PIC)

@ecolternv

### Motivation

The current Blackwell generation of GPUs with GB200 and GB300, and future announced GPU generations feature Multi Node NVLINK with NVL72 and even higher numbers like NVL144, etc.

To properly get full node to node performance for multi-node training, workloads scheduled in kubernetes need to take advantage of NvLink, and carefully control how they nodes are placed in the cluster so they end up in the same racks.

### Problem

**ComputeDomain CRD:**
To use MultiNode NvLink in kubernetes, you must create a ComputeDomain CRD and all pods that are part of a given training run must have a resourceClaim that points to the same ComputeDomain. OSMO does not currently support creating/destroying ComputeDomains along with the lifecycle of the workflow.

**Topology Aware Scheduling:**
To use MultiNode NvLink, pods that wish to communicate using NvLink must be placed in the same NVL72/NVL144 rack by the scheduler.

For very large training runs, users may want full control over the placement of tasks in racks, for example a training run may be 4 x tensor parallel, 4 x pipeline parallel, and 4 x data parallel. This means there will be 64 pods total: 4 groups of 16 that each represent a single instance of a model. A user may wish to ensure that each group of 16 pods that represent an instance of the model all end up in the same NVL72 rack.

Currently OSMO has some ability to handle topology aware scheduling with podAffinities added to OSMO pod templates, but this doesn't provide the granular level of control needed for many usecases in an NVLink enabled cluster.

### References

- Compute Domain CRD: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-cds.html
- KAI Scheduler topology aware scheduling docs: https://github.com/NVIDIA/KAI-Scheduler/tree/main/docs/topology

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project: Native Support for Multi-node NvLink #206

Have you read the Project Process docs?

Summary

Person in Charge (PIC)

Motivation

Problem

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Project: Native Support for Multi-node NvLink #206

Description

Have you read the Project Process docs?

Summary

Person in Charge (PIC)

Motivation

Problem

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions