4
4
- [ Release Signoff Checklist] ( #release-signoff-checklist )
5
5
- [ Summary] ( #summary )
6
6
- [ Motivation] ( #motivation )
7
+ - [ Dynamic allocation of Multi-Instance GPUs (MIG) on NVIDIA hardware] ( #dynamic-allocation-of-multi-instance-gpus-mig-on-nvidia-hardware )
8
+ - [ Multi-host Tensor Processing Unit (TPU) scheduling] ( #multi-host-tensor-processing-unit-tpu-scheduling )
7
9
- [ Goals] ( #goals )
8
10
- [ Non-Goals] ( #non-goals )
9
11
- [ Proposal] ( #proposal )
10
12
- [ Design Details] ( #design-details )
11
13
- [ Extending a device with as set of mixins] ( #extending-a-device-with-as-set-of-mixins )
12
14
- [ Defining device partitions in terms of consumed capacity in a composite device] ( #defining-device-partitions-in-terms-of-consumed-capacity-in-a-composite-device )
15
+ - [ Defining multi-host devices] ( #defining-multi-host-devices )
13
16
- [ Putting it all together for the MIG use-case] ( #putting-it-all-together-for-the-mig-use-case )
17
+ - [ Using DRA for the multi-host use-case] ( #using-dra-for-the-multi-host-use-case )
14
18
- [ Test Plan] ( #test-plan )
15
19
- [ Prerequisite testing updates] ( #prerequisite-testing-updates )
16
20
- [ Unit tests] ( #unit-tests )
@@ -72,6 +76,11 @@ partitions to be created on demand. This leads to increased resource
72
76
utilization as the size of each partitioned device can be matched in real-time
73
77
to the workload requesting it.
74
78
79
+ Devices represented in DRA doesn't necessarily have to be a single unit connected
80
+ to a single machine, but can also be a logical device comprised of multiple devices
81
+ connected to multiple machines. Similar to the single device partitioning, users
82
+ might require either the full multi-host device or a subset.
83
+
75
84
As DRA has evolved from what we now call "classic" DRA to "structured
76
85
parameters" this ability to dynamically partition devices has been lost.
77
86
This KEP proposes a method to bring this capability back within the framework
@@ -92,9 +101,12 @@ allocated, rather than requiring them to be created before.
92
101
93
102
## Motivation
94
103
95
- One of the primary motivating examples for supporting partitionable devices
96
- with DRA is to enable the dynamic allocation of Multi-Instance GPUs
97
- (MIG) on NVIDIA hardware. MIG devices are represented as fixed-size partitions
104
+ We have two primary motivating examples for supporting partitionable devices
105
+ with DRA. The first is for partitioning a single GPU into smaller partitions, while
106
+ the second is multi-host scheduling of interconnected TPUs.
107
+
108
+ ### Dynamic allocation of Multi-Instance GPUs (MIG) on NVIDIA hardware
109
+ MIG devices are represented as fixed-size partitions
98
110
of a full GPU that consume a portion of its capacity across multiple
99
111
dimensions. These dimensions include things like number of JPEG engines, number
100
112
of multiprocessors, and the allocation of a specific set of fixed-size memory
@@ -221,7 +233,60 @@ done *after* the scheduler has allocated these devices, keeping the GPU free to
221
233
be partitioned in different ways until the actual user-workload requesting them
222
234
has been submitted.
223
235
224
- With his motivating example in mind. We define the following goals and
236
+ ### Multi-host Tensor Processing Unit (TPU) scheduling
237
+
238
+ TPUs are connected to VMs, usually four TPUs per VM. In order to run large
239
+ workloads that require multiple TPUs, groups of TPUs can be connected over
240
+ a high-speed inter-chip interconnect, which is important to achieve the best
241
+ performance. However, not all TPUs in the group are connected to each other,
242
+ so we need to consider the topology when we make decisions about the allocation
243
+ of TPUs to workloads.
244
+
245
+ Due to the topology, only certain specific slices of TPUs can be used.
246
+ For example, in a 64 TPU node pool there will be 16 VMs, each with 4
247
+ TPUs. This allows for a number of possible multi-VM slices of different
248
+ sizes:
249
+ * 8x8 slice, which provides 64 TPUs across 16 nodes (shown in black)
250
+ * 4x8 slices, which provides 32 TPUs across 8 nodes (shown in purple)
251
+ * 4x4 slices, which provides 16 TPUs across 4 nodes (shown in green)
252
+ * 2x4 slices, which provides 8 TPUs across 2 nodes (shown in red)
253
+
254
+ ![ image] ( tpu-topology.png )
255
+
256
+ For example, a user can request a 4x4 slice of TPUs with a ` ResourceClaim `
257
+ like the following:
258
+
259
+ ``` yaml
260
+ apiVersion : resource.k8s.io/v1beta1
261
+ kind : ResourceClaim
262
+ metadata :
263
+ name : tpu-device
264
+ spec :
265
+ spec :
266
+ devices :
267
+ requests :
268
+ - name : 4x4-tpu
269
+ deviceClassName : tpu.google.com
270
+ selectors :
271
+ - cel :
272
+ expression : " device.capacity['google-tpu'].tpus == '16'"
273
+ ` ` `
274
+
275
+ There are four "good" allocations for this request:
276
+ * All TPUs on nodes 1, 2, 5, and 6.
277
+ * All TPUs on nodes 3, 4, 7, and 8.
278
+ * All TPUs on nodes 9, 10, 13, and 14.
279
+ * All TPUs on nodes 11, 12, 15, and 16.
280
+
281
+ A request like the one above must be allocated one of the four 4x4 slices
282
+ or it should not succeed. A request asking for just 16 TPUs will likely
283
+ result in allocation of TPUs across many VMs and without the interconnect,
284
+ leading to poor performance. So we need to allow users to request a
285
+ partition of a device (in this case a 8x8 slice of TPUs) and account for
286
+ the fact that this uses some of the capacity required for other slices.
287
+
288
+
289
+ With these motivating examples in mind. We define the following goals and
225
290
non-goals of this KEP.
226
291
227
292
### Goals
@@ -255,9 +320,10 @@ non-goals of this KEP.
255
320
The basic idea is the following :
256
321
257
322
1. Introduce a new device type called `CompositeDevice` which has the same
258
- fields as a ` BasicDevice ` , plus two more. The first is a field called
259
- ` Includes ` and the second is a field called ` ConsumesCapacityFrom ` . Both
260
- full devices and their partitions are represented as instances of this new
323
+ fields as a `BasicDevice`, plus four additional fields. The first is a field called
324
+ ` Includes` and the second is a field called `ConsumesCapacityFrom`. The last
325
+ two fields are `NodeName` and `NodeSelector`. Both full devices and their
326
+ partitions are represented as instances of this new
261
327
` CompositeDevice` type and are listed right next to one another in the
262
328
top-level `Devices` list of a `ResourceSlice`.
263
329
@@ -278,13 +344,21 @@ The basic idea is the following:
278
344
` ConsumesCapacityFrom` list can reference devices in any `ResourceSlice` in
279
345
the same pool. References to devices in other pools are not allowed.
280
346
347
+ 1. The `NodeName` and `NodeSelector` fields describes the node or set of nodes
348
+ where the device is available. This is similar to the `NodeName`, `NodeSelector`,
349
+ and `AllNodes` properties in the `ResourceSlice` spec, but this allows for
350
+ associating individual devices to node(s). That makes it possible to describe
351
+ multi-host devices using the ResourceSlice API. The `NodeName` and `NodeSelector`
352
+ fields are mutually exclusive and neither can be specified if the `Spec.NodeName` or
353
+ ` Spec.NodeSelector` fields are specified on the `ResourceSlice`.
354
+
281
355
With these additions in place, the scheduler has everything it needs to support
282
- the dynamic allocation of both full devices and their (possibly overlapping)
283
- fixed-size partitions. That is to say, the scheduler now has the ability to
284
- "flatten" all devices by applying any mixins from their ` Includes ` fields as
285
- well as track any capacities consumed from one device by another through its
286
- ` ConsumesCapacityFrom ` field. More details on the actual algorithm the
287
- scheduler follows to make allocation decisions based on the
356
+ the dynamic allocation of both full devices, their (possibly overlapping)
357
+ fixed-size partitions, and multi-host devices . That is to say, the scheduler now
358
+ has the ability to "flatten" all devices by applying any mixins from their
359
+ ` Includes ` fields as well as track any capacities consumed from one device
360
+ by another through its `ConsumesCapacityFrom` field. More details on the
361
+ actual algorithm the scheduler follows to make allocation decisions based on the
288
362
` ConsumesCapacityFrom` field can be found in the Design Details section below.
289
363
290
364
# # Design Details
@@ -418,6 +492,26 @@ type CompositeDevice struct {
418
492
// +listType=atomic
419
493
ConsumesCapacityFrom []DeviceRef `json:"consumesCapacityFrom,omitempty"`
420
494
495
+ // NodeName identifies the node where the device is available.
496
+ //
497
+ // Must only be set if Spec.AllNodes is set.
498
+ // Only one or none of NodeName and NodeSelector must be set.
499
+ //
500
+ // +optional
501
+ // +oneOf=DeviceNodeSelection
502
+ NodeName string
503
+
504
+ // NodeSelector defines the nodes where the device is available.
505
+ //
506
+ // Must use exactly one term.
507
+ //
508
+ // Must only be set if Spec.AllNodes is set.
509
+ // Only one or none of NodeName and NodeSelector must be set.
510
+ //
511
+ // +optional
512
+ // +oneOf=DeviceNodeSelection
513
+ NodeSelector *core.NodeSelector
514
+
421
515
// Attributes defines the set of attributes for this device.
422
516
// The name of each attribute must be unique in that set.
423
517
//
@@ -467,9 +561,10 @@ type DeviceRef struct {
467
561
```
468
562
469
563
As mentioned previously, the main features being added here are (1) the ability
470
- to include a set of mixins in a device definition, and (2) the ability to
564
+ to include a set of mixins in a device definition, (2) the ability to
471
565
express that capacity from one device gets consumed by another device if/when
472
- the scheduler decides to allocate it.
566
+ the scheduler decides to allocate it, and (3) the ability to define multi-host
567
+ devices.
473
568
474
569
To simplify the conversation, we discuss each new feature separately, starting
475
570
with "mixins" and the new ` Includes ` field, which allows a set of mixins to
@@ -631,7 +726,7 @@ follows:
631
726
" sink" of capacity, pulling from "source" devices in order to satisfy its
632
727
own capacity when allocated.
633
728
634
- The scheduler must track the availablity of the "source" device, and
729
+ The scheduler must track the availability of the "source" device, and
635
730
pull from it whenever it decides to allocate a "sink" device.
636
731
637
732
So long as no other devices have been allocated that reference a given "source"
@@ -660,7 +755,124 @@ device it references directly.
660
755
The API defines the `ConsumesCapacityFrom` field as a list of `DeviceRef` entries.
661
756
While we only allow a single entry in this list, effectily forcing the partitioning
662
757
of a device to form a tree of devices that consumes capacity, this can be extended
663
- in the future to allow a device to reference multiple "source" devices.
758
+ in the future to allow a device to reference multiple "source" devices.
759
+
760
+ # ## Defining multi-host devices
761
+
762
+ An example of a small 4x4 TPU slice with its partitions will look like the
763
+ following :
764
+
765
+ ` ` ` yaml
766
+ kind: ResourceSlice
767
+ apiVersion: resource.k8s.io/v1beta1
768
+ ...
769
+ spec:
770
+ allNodes: true
771
+ pool:
772
+ ...
773
+ driver: tpu.dra.example.com
774
+ devices:
775
+ # 4x4 slice
776
+ - name: tpu-4x4-1
777
+ composite:
778
+ nodeSelector:
779
+ nodeSelectorTerms:
780
+ - matchExpressions:
781
+ - key: kubernetes.io/hostname
782
+ operator: IN
783
+ values:
784
+ - node-1
785
+ - node-2
786
+ - node-5
787
+ - node-6
788
+ capacity:
789
+ tpus: "16"
790
+ consumesCapacityFrom:
791
+ - name: tpu-4x8-1
792
+ # 2x4 slices
793
+ - name: tpu-2x4-1
794
+ composite:
795
+ nodeSelector:
796
+ nodeSelectorTerms:
797
+ - matchExpressions:
798
+ - key: kubernetes.io/hostname
799
+ operator: IN
800
+ values:
801
+ - node-1
802
+ - node-2
803
+ capacity:
804
+ tpus: "8"
805
+ consumesCapacityFrom:
806
+ - name: tpu-4x4-1
807
+ - name: tpu-2x4-2
808
+ composite:
809
+ nodeSelector:
810
+ nodeSelectorTerms:
811
+ - matchExpressions:
812
+ - key: kubernetes.io/hostname
813
+ operator: IN
814
+ values:
815
+ - node-5
816
+ - node-6
817
+ capacity:
818
+ tpus: "8"
819
+ consumesCapacityFrom:
820
+ - name: tpu-4x4-1
821
+ # 2x2 slices
822
+ - name: tpu-2x2-1
823
+ composite:
824
+ nodeName: node-1
825
+ capacity:
826
+ tpus: "4"
827
+ consumesCapacityFrom:
828
+ - name: tpu-2x4-1
829
+ - name: tpu-2x2-2
830
+ composite:
831
+ nodeName: node-2
832
+ capacity:
833
+ tpus: "4"
834
+ consumesCapacityFrom:
835
+ - name: tpu-2x4-1
836
+ - name: tpu-2x2-3
837
+ composite:
838
+ nodeName: node-5
839
+ capacity:
840
+ tpus: "4"
841
+ consumesCapacityFrom:
842
+ - name: tpu-2x4-2
843
+ - name: tpu-2x2-4
844
+ composite:
845
+ nodeName: node-6
846
+ capacity:
847
+ tpus: "4"
848
+ consumesCapacityFrom:
849
+ - name: tpu-2x4-2
850
+ ` ` `
851
+
852
+ In the example we defined a single 4x4 slice. That means 16 TPUs and with
853
+ 4 TPUs per node, the device is available across four nodes. The node selector
854
+ on the devices selects the 4 nodes used by this device. In the example it
855
+ does with by the `IN` operator on the `kubernetes.io/hostname` key, but this
856
+ could also be just a regular selector on a single label set on all nodes.
857
+
858
+ The `ConsumesCapacityFrom` field declares that the smaller slices is a partition
859
+ of the larger one, and as described in the previous section, this will
860
+ allow the scheduler to understand that allocating a partition of a device has
861
+ the consequence of making other partitions unvailable.
862
+
863
+ When a multi-host device is requested, the workload must have a number of pods
864
+ that equals the number of nodes that make up the device. These pods will share
865
+ the device, so they must be set up with a shared ResourceClaim. When the scheduler
866
+ attempts to schedule the first pod for the workload, it will find a device that
867
+ matches the request and allocate it for the ResourceClaim. Once the a device has
868
+ been allocated for the claim, this also restricts the nodes where other pods using
869
+ the device can be scheduled. To make sure that future pods do get scheduled on an
870
+ eligible node, the scheduler will use `nodeName` or `nodeSelector` value from the
871
+ device to determine the `nodeSelector` field on the `AllocationResult`
872
+ in the `ResourceClaim`, rather than the `nodeName` or `nodeSelector` from the
873
+ ` ResourceSlice` . This makes sure that all pods sharing the `ResourceClaim` will
874
+ be scheduled to the nodes that make up the device.
875
+
664
876
665
877
# ## Putting it all together for the MIG use-case
666
878
@@ -1664,6 +1876,73 @@ devices:
1664
1876
- name: memory-slices-0-7
1665
1877
` ` `
1666
1878
1879
+ # ## Using DRA for the multi-host use-case
1880
+
1881
+ In order to allocate a 2x4 TPU slice using the ResourceSlice
1882
+ [shown above](#defining-multi-host-devices), a ResourceClaim like the
1883
+ following can be used :
1884
+
1885
+ ` ` ` yaml
1886
+ apiVersion: resource.k8s.io/v1beta1
1887
+ kind: ResourceClaim
1888
+ metadata:
1889
+ name: tpu-consumer-resource-claim
1890
+ spec:
1891
+ devices:
1892
+ requests:
1893
+ - name: tpu-request
1894
+ deviceClassName: tpu.google.com
1895
+ selectors:
1896
+ - cel:
1897
+ expression: "device.capacity['tpu.google.com'].tpus == '8'"
1898
+ ` ` `
1899
+
1900
+ This simply requests a device with 8 TPUs. Since there are 4 TPUs per node, this requires
1901
+ two pods, one for each node. A Deployment can be used to create the necessary number of
1902
+ pods :
1903
+
1904
+ ` ` ` yaml
1905
+ apiVersion: apps/v1
1906
+ kind: Deployment
1907
+ metadata:
1908
+ name: tpu-consumer
1909
+ spec:
1910
+ replicas: 2
1911
+ selector:
1912
+ matchLabels:
1913
+ app: tpu-consumer
1914
+ template:
1915
+ metadata:
1916
+ labels:
1917
+ app: tpu-consumer
1918
+ spec:
1919
+ affinity:
1920
+ podAntiAffinity:
1921
+ requiredDuringSchedulingIgnoredDuringExecution:
1922
+ - weight: 100
1923
+ podAffinityTerm:
1924
+ labelSelector:
1925
+ matchLabels:
1926
+ app: tpu-consumer
1927
+ topologyKey: kubernetes.io/hostname
1928
+ resourceClaims:
1929
+ - name: "tpu"
1930
+ resourceClaimName: tpu-consumer-resource-claim
1931
+ containers:
1932
+ - name: workload
1933
+ image: my-app
1934
+ command: ["/bin/program"]
1935
+ resources:
1936
+ claims:
1937
+ - name: "tpu"
1938
+ ` ` `
1939
+
1940
+ Since the PodSpec references a ResourceClaim rather than a ResourceClaimTemplate, they will
1941
+ share the ResourceClaim. This will then also restrict the pods to run on the nodes that are
1942
+ targeted by the node selector on the allocated device. Now, in order to be able to take
1943
+ advantage of the TPUs that are connected to the two nodes, the pods need to be scheduled
1944
+ on separate nodes. The antiaffinity stanza in the PodSpec makes sure this happens.
1945
+
1667
1946
# ## Test Plan
1668
1947
1669
1948
<!--
0 commit comments