Skip to content

Project: Native Support for Multi-node NvLink #206

@ecolternv

Description

@ecolternv

Have you read the Project Process docs?

  • Yes, I have read and understood the RFC docs

Summary

  • Add support for creating ComputeDomain CRDs natively for workflows that have NVL72 or other multi-node nvlink enabled backends.
  • Add support for topology aware scheduling introduced in KAI Scheduler v0.10 to allow users to scheduled all of their tasks of a given workflow on the same NVL72 rack.

Person in Charge (PIC)

@ecolternv

Motivation

The current Blackwell generation of GPUs with GB200 and GB300, and future announced GPU generations feature Multi Node NVLINK with NVL72 and even higher numbers like NVL144, etc.

To properly get full node to node performance for multi-node training, workloads scheduled in kubernetes need to take advantage of NvLink, and carefully control how they nodes are placed in the cluster so they end up in the same racks.

Problem

ComputeDomain CRD:
To use MultiNode NvLink in kubernetes, you must create a ComputeDomain CRD and all pods that are part of a given training run must have a resourceClaim that points to the same ComputeDomain. OSMO does not currently support creating/destroying ComputeDomains along with the lifecycle of the workflow.

Topology Aware Scheduling:
To use MultiNode NvLink, pods that wish to communicate using NvLink must be placed in the same NVL72/NVL144 rack by the scheduler.

For very large training runs, users may want full control over the placement of tasks in racks, for example a training run may be 4 x tensor parallel, 4 x pipeline parallel, and 4 x data parallel. This means there will be 64 pods total: 4 groups of 16 that each represent a single instance of a model. A user may wish to ensure that each group of 16 pods that represent an instance of the model all end up in the same NVL72 rack.

Currently OSMO has some ability to handle topology aware scheduling with podAffinities added to OSMO pod templates, but this doesn't provide the granular level of control needed for many usecases in an NVLink enabled cluster.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    projectA new project or major change

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions