Skip to content

Commit 746feed

Browse files
committed
Update PartitionableDevices with support for multi-host
1 parent 75d3cc4 commit 746feed

File tree

2 files changed

+296
-17
lines changed

2 files changed

+296
-17
lines changed

keps/sig-node/4815-dra-partitionable-devices/README.md

+296-17
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,17 @@
44
- [Release Signoff Checklist](#release-signoff-checklist)
55
- [Summary](#summary)
66
- [Motivation](#motivation)
7+
- [Dynamic allocation of Multi-Instance GPUs (MIG) on NVIDIA hardware](#dynamic-allocation-of-multi-instance-gpus-mig-on-nvidia-hardware)
8+
- [Multi-host Tensor Processing Unit (TPU) scheduling](#multi-host-tensor-processing-unit-tpu-scheduling)
79
- [Goals](#goals)
810
- [Non-Goals](#non-goals)
911
- [Proposal](#proposal)
1012
- [Design Details](#design-details)
1113
- [Extending a device with as set of mixins](#extending-a-device-with-as-set-of-mixins)
1214
- [Defining device partitions in terms of consumed capacity in a composite device](#defining-device-partitions-in-terms-of-consumed-capacity-in-a-composite-device)
15+
- [Defining multi-host devices](#defining-multi-host-devices)
1316
- [Putting it all together for the MIG use-case](#putting-it-all-together-for-the-mig-use-case)
17+
- [Using DRA for the multi-host use-case](#using-dra-for-the-multi-host-use-case)
1418
- [Test Plan](#test-plan)
1519
- [Prerequisite testing updates](#prerequisite-testing-updates)
1620
- [Unit tests](#unit-tests)
@@ -72,6 +76,11 @@ partitions to be created on demand. This leads to increased resource
7276
utilization as the size of each partitioned device can be matched in real-time
7377
to the workload requesting it.
7478

79+
Devices represented in DRA doesn't necessarily have to be a single unit connected
80+
to a single machine, but can also be a logical device comprised of multiple devices
81+
connected to multiple machines. Similar to the single device partitioning, users
82+
might require either the full multi-host device or a subset.
83+
7584
As DRA has evolved from what we now call "classic" DRA to "structured
7685
parameters" this ability to dynamically partition devices has been lost.
7786
This KEP proposes a method to bring this capability back within the framework
@@ -92,9 +101,12 @@ allocated, rather than requiring them to be created before.
92101

93102
## Motivation
94103

95-
One of the primary motivating examples for supporting partitionable devices
96-
with DRA is to enable the dynamic allocation of Multi-Instance GPUs
97-
(MIG) on NVIDIA hardware. MIG devices are represented as fixed-size partitions
104+
We have two primary motivating examples for supporting partitionable devices
105+
with DRA. The first is for partitioning a single GPU into smaller partitions, while
106+
the second is multi-host scheduling of interconnected TPUs.
107+
108+
### Dynamic allocation of Multi-Instance GPUs (MIG) on NVIDIA hardware
109+
MIG devices are represented as fixed-size partitions
98110
of a full GPU that consume a portion of its capacity across multiple
99111
dimensions. These dimensions include things like number of JPEG engines, number
100112
of multiprocessors, and the allocation of a specific set of fixed-size memory
@@ -221,7 +233,60 @@ done *after* the scheduler has allocated these devices, keeping the GPU free to
221233
be partitioned in different ways until the actual user-workload requesting them
222234
has been submitted.
223235

224-
With his motivating example in mind. We define the following goals and
236+
### Multi-host Tensor Processing Unit (TPU) scheduling
237+
238+
TPUs are connected to VMs, usually four TPUs per VM. In order to run large
239+
workloads that require multiple TPUs, groups of TPUs can be connected over
240+
a high-speed inter-chip interconnect, which is important to achieve the best
241+
performance. However, not all TPUs in the group are connected to each other,
242+
so we need to consider the topology when we make decisions about the allocation
243+
of TPUs to workloads.
244+
245+
Due to the topology, only certain specific slices of TPUs can be used.
246+
For example, in a 64 TPU node pool there will be 16 VMs, each with 4
247+
TPUs. This allows for a number of possible multi-VM slices of different
248+
sizes:
249+
* 8x8 slice, which provides 64 TPUs across 16 nodes (shown in black)
250+
* 4x8 slices, which provides 32 TPUs across 8 nodes (shown in purple)
251+
* 4x4 slices, which provides 16 TPUs across 4 nodes (shown in green)
252+
* 2x4 slices, which provides 8 TPUs across 2 nodes (shown in red)
253+
254+
![image](tpu-topology.png)
255+
256+
For example, a user can request a 4x4 slice of TPUs with a `ResourceClaim`
257+
like the following:
258+
259+
```yaml
260+
apiVersion: resource.k8s.io/v1beta1
261+
kind: ResourceClaim
262+
metadata:
263+
name: tpu-device
264+
spec:
265+
spec:
266+
devices:
267+
requests:
268+
- name: 4x4-tpu
269+
deviceClassName: tpu.google.com
270+
selectors:
271+
- cel:
272+
expression: "device.capacity['google-tpu'].tpus == '16'"
273+
```
274+
275+
There are four "good" allocations for this request:
276+
* All TPUs on nodes 1, 2, 5, and 6.
277+
* All TPUs on nodes 3, 4, 7, and 8.
278+
* All TPUs on nodes 9, 10, 13, and 14.
279+
* All TPUs on nodes 11, 12, 15, and 16.
280+
281+
A request like the one above must be allocated one of the four 4x4 slices
282+
or it should not succeed. A request asking for just 16 TPUs will likely
283+
result in allocation of TPUs across many VMs and without the interconnect,
284+
leading to poor performance. So we need to allow users to request a
285+
partition of a device (in this case a 8x8 slice of TPUs) and account for
286+
the fact that this uses some of the capacity required for other slices.
287+
288+
289+
With these motivating examples in mind. We define the following goals and
225290
non-goals of this KEP.
226291
227292
### Goals
@@ -255,9 +320,10 @@ non-goals of this KEP.
255320
The basic idea is the following:
256321

257322
1. Introduce a new device type called `CompositeDevice` which has the same
258-
fields as a `BasicDevice`, plus two more. The first is a field called
259-
`Includes` and the second is a field called `ConsumesCapacityFrom`. Both
260-
full devices and their partitions are represented as instances of this new
323+
fields as a `BasicDevice`, plus four additional fields. The first is a field called
324+
`Includes` and the second is a field called `ConsumesCapacityFrom`. The last
325+
two fields are `NodeName` and `NodeSelector`. Both full devices and their
326+
partitions are represented as instances of this new
261327
`CompositeDevice` type and are listed right next to one another in the
262328
top-level `Devices` list of a `ResourceSlice`.
263329

@@ -278,13 +344,21 @@ The basic idea is the following:
278344
`ConsumesCapacityFrom` list can reference devices in any `ResourceSlice` in
279345
the same pool. References to devices in other pools are not allowed.
280346

347+
1. The `NodeName` and `NodeSelector` fields describes the node or set of nodes
348+
where the device is available. This is similar to the `NodeName`, `NodeSelector`,
349+
and `AllNodes` properties in the `ResourceSlice` spec, but this allows for
350+
associating individual devices to node(s). That makes it possible to describe
351+
multi-host devices using the ResourceSlice API. The `NodeName` and `NodeSelector`
352+
fields are mutually exclusive and neither can be specified if the `Spec.NodeName` or
353+
`Spec.NodeSelector` fields are specified on the `ResourceSlice`.
354+
281355
With these additions in place, the scheduler has everything it needs to support
282-
the dynamic allocation of both full devices and their (possibly overlapping)
283-
fixed-size partitions. That is to say, the scheduler now has the ability to
284-
"flatten" all devices by applying any mixins from their `Includes` fields as
285-
well as track any capacities consumed from one device by another through its
286-
`ConsumesCapacityFrom` field. More details on the actual algorithm the
287-
scheduler follows to make allocation decisions based on the
356+
the dynamic allocation of both full devices, their (possibly overlapping)
357+
fixed-size partitions, and multi-host devices. That is to say, the scheduler now
358+
has the ability to "flatten" all devices by applying any mixins from their
359+
`Includes` fields as well as track any capacities consumed from one device
360+
by another through its `ConsumesCapacityFrom` field. More details on the
361+
actual algorithm the scheduler follows to make allocation decisions based on the
288362
`ConsumesCapacityFrom` field can be found in the Design Details section below.
289363

290364
## Design Details
@@ -418,6 +492,26 @@ type CompositeDevice struct {
418492
// +listType=atomic
419493
ConsumesCapacityFrom []DeviceRef `json:"consumesCapacityFrom,omitempty"`
420494

495+
// NodeName identifies the node where the device is available.
496+
//
497+
// Must only be set if Spec.AllNodes is set.
498+
// Only one or none of NodeName and NodeSelector must be set.
499+
//
500+
// +optional
501+
// +oneOf=DeviceNodeSelection
502+
NodeName string
503+
504+
// NodeSelector defines the nodes where the device is available.
505+
//
506+
// Must use exactly one term.
507+
//
508+
// Must only be set if Spec.AllNodes is set.
509+
// Only one or none of NodeName and NodeSelector must be set.
510+
//
511+
// +optional
512+
// +oneOf=DeviceNodeSelection
513+
NodeSelector *core.NodeSelector
514+
421515
// Attributes defines the set of attributes for this device.
422516
// The name of each attribute must be unique in that set.
423517
//
@@ -467,9 +561,10 @@ type DeviceRef struct {
467561
```
468562

469563
As mentioned previously, the main features being added here are (1) the ability
470-
to include a set of mixins in a device definition, and (2) the ability to
564+
to include a set of mixins in a device definition, (2) the ability to
471565
express that capacity from one device gets consumed by another device if/when
472-
the scheduler decides to allocate it.
566+
the scheduler decides to allocate it, and (3) the ability to define multi-host
567+
devices.
473568

474569
To simplify the conversation, we discuss each new feature separately, starting
475570
with "mixins" and the new `Includes` field, which allows a set of mixins to
@@ -631,7 +726,7 @@ follows:
631726
"sink" of capacity, pulling from "source" devices in order to satisfy its
632727
own capacity when allocated.
633728

634-
The scheduler must track the availablity of the "source" device, and
729+
The scheduler must track the availability of the "source" device, and
635730
pull from it whenever it decides to allocate a "sink" device.
636731

637732
So long as no other devices have been allocated that reference a given "source"
@@ -660,7 +755,124 @@ device it references directly.
660755
The API defines the `ConsumesCapacityFrom` field as a list of `DeviceRef` entries.
661756
While we only allow a single entry in this list, effectily forcing the partitioning
662757
of a device to form a tree of devices that consumes capacity, this can be extended
663-
in the future to allow a device to reference multiple "source" devices.
758+
in the future to allow a device to reference multiple "source" devices.
759+
760+
### Defining multi-host devices
761+
762+
An example of a small 4x4 TPU slice with its partitions will look like the
763+
following:
764+
765+
```yaml
766+
kind: ResourceSlice
767+
apiVersion: resource.k8s.io/v1beta1
768+
...
769+
spec:
770+
allNodes: true
771+
pool:
772+
...
773+
driver: tpu.dra.example.com
774+
devices:
775+
# 4x4 slice
776+
- name: tpu-4x4-1
777+
composite:
778+
nodeSelector:
779+
nodeSelectorTerms:
780+
- matchExpressions:
781+
- key: kubernetes.io/hostname
782+
operator: IN
783+
values:
784+
- node-1
785+
- node-2
786+
- node-5
787+
- node-6
788+
capacity:
789+
tpus: "16"
790+
consumesCapacityFrom:
791+
- name: tpu-4x8-1
792+
# 2x4 slices
793+
- name: tpu-2x4-1
794+
composite:
795+
nodeSelector:
796+
nodeSelectorTerms:
797+
- matchExpressions:
798+
- key: kubernetes.io/hostname
799+
operator: IN
800+
values:
801+
- node-1
802+
- node-2
803+
capacity:
804+
tpus: "8"
805+
consumesCapacityFrom:
806+
- name: tpu-4x4-1
807+
- name: tpu-2x4-2
808+
composite:
809+
nodeSelector:
810+
nodeSelectorTerms:
811+
- matchExpressions:
812+
- key: kubernetes.io/hostname
813+
operator: IN
814+
values:
815+
- node-5
816+
- node-6
817+
capacity:
818+
tpus: "8"
819+
consumesCapacityFrom:
820+
- name: tpu-4x4-1
821+
# 2x2 slices
822+
- name: tpu-2x2-1
823+
composite:
824+
nodeName: node-1
825+
capacity:
826+
tpus: "4"
827+
consumesCapacityFrom:
828+
- name: tpu-2x4-1
829+
- name: tpu-2x2-2
830+
composite:
831+
nodeName: node-2
832+
capacity:
833+
tpus: "4"
834+
consumesCapacityFrom:
835+
- name: tpu-2x4-1
836+
- name: tpu-2x2-3
837+
composite:
838+
nodeName: node-5
839+
capacity:
840+
tpus: "4"
841+
consumesCapacityFrom:
842+
- name: tpu-2x4-2
843+
- name: tpu-2x2-4
844+
composite:
845+
nodeName: node-6
846+
capacity:
847+
tpus: "4"
848+
consumesCapacityFrom:
849+
- name: tpu-2x4-2
850+
```
851+
852+
In the example we defined a single 4x4 slice. That means 16 TPUs and with
853+
4 TPUs per node, the device is available across four nodes. The node selector
854+
on the devices selects the 4 nodes used by this device. In the example it
855+
does with by the `IN` operator on the `kubernetes.io/hostname` key, but this
856+
could also be just a regular selector on a single label set on all nodes.
857+
858+
The `ConsumesCapacityFrom` field declares that the smaller slices is a partition
859+
of the larger one, and as described in the previous section, this will
860+
allow the scheduler to understand that allocating a partition of a device has
861+
the consequence of making other partitions unvailable.
862+
863+
When a multi-host device is requested, the workload must have a number of pods
864+
that equals the number of nodes that make up the device. These pods will share
865+
the device, so they must be set up with a shared ResourceClaim. When the scheduler
866+
attempts to schedule the first pod for the workload, it will find a device that
867+
matches the request and allocate it for the ResourceClaim. Once the a device has
868+
been allocated for the claim, this also restricts the nodes where other pods using
869+
the device can be scheduled. To make sure that future pods do get scheduled on an
870+
eligible node, the scheduler will use `nodeName` or `nodeSelector` value from the
871+
device to determine the `nodeSelector` field on the `AllocationResult`
872+
in the `ResourceClaim`, rather than the `nodeName` or `nodeSelector` from the
873+
`ResourceSlice`. This makes sure that all pods sharing the `ResourceClaim` will
874+
be scheduled to the nodes that make up the device.
875+
664876

665877
### Putting it all together for the MIG use-case
666878

@@ -1664,6 +1876,73 @@ devices:
16641876
- name: memory-slices-0-7
16651877
```
16661878

1879+
### Using DRA for the multi-host use-case
1880+
1881+
In order to allocate a 2x4 TPU slice using the ResourceSlice
1882+
[shown above](#defining-multi-host-devices), a ResourceClaim like the
1883+
following can be used:
1884+
1885+
```yaml
1886+
apiVersion: resource.k8s.io/v1beta1
1887+
kind: ResourceClaim
1888+
metadata:
1889+
name: tpu-consumer-resource-claim
1890+
spec:
1891+
devices:
1892+
requests:
1893+
- name: tpu-request
1894+
deviceClassName: tpu.google.com
1895+
selectors:
1896+
- cel:
1897+
expression: "device.capacity['tpu.google.com'].tpus == '8'"
1898+
```
1899+
1900+
This simply requests a device with 8 TPUs. Since there are 4 TPUs per node, this requires
1901+
two pods, one for each node. A Deployment can be used to create the necessary number of
1902+
pods:
1903+
1904+
```yaml
1905+
apiVersion: apps/v1
1906+
kind: Deployment
1907+
metadata:
1908+
name: tpu-consumer
1909+
spec:
1910+
replicas: 2
1911+
selector:
1912+
matchLabels:
1913+
app: tpu-consumer
1914+
template:
1915+
metadata:
1916+
labels:
1917+
app: tpu-consumer
1918+
spec:
1919+
affinity:
1920+
podAntiAffinity:
1921+
requiredDuringSchedulingIgnoredDuringExecution:
1922+
- weight: 100
1923+
podAffinityTerm:
1924+
labelSelector:
1925+
matchLabels:
1926+
app: tpu-consumer
1927+
topologyKey: kubernetes.io/hostname
1928+
resourceClaims:
1929+
- name: "tpu"
1930+
resourceClaimName: tpu-consumer-resource-claim
1931+
containers:
1932+
- name: workload
1933+
image: my-app
1934+
command: ["/bin/program"]
1935+
resources:
1936+
claims:
1937+
- name: "tpu"
1938+
```
1939+
1940+
Since the PodSpec references a ResourceClaim rather than a ResourceClaimTemplate, they will
1941+
share the ResourceClaim. This will then also restrict the pods to run on the nodes that are
1942+
targeted by the node selector on the allocated device. Now, in order to be able to take
1943+
advantage of the TPUs that are connected to the two nodes, the pods need to be scheduled
1944+
on separate nodes. The antiaffinity stanza in the PodSpec makes sure this happens.
1945+
16671946
### Test Plan
16681947

16691948
<!--
Loading

0 commit comments

Comments
 (0)