KEP-4815 DRA Partitionable devices design update #5066

mortent · 2025-01-21T18:29:43Z

One-line PR description: Update the API of the feature to use the abstract pools design.

Issue link: DRA: Add support for partitionable devices #4815

Other comments: This PR also moves this feature from sig-node to sig-scheduling.

mortent · 2025-01-22T19:12:23Z

/assign @pohly
/assign @klueska
/assign @johnbelamaric

johnbelamaric · 2025-01-22T19:26:12Z

/lgtm

This change will ensure we get some implementation merged in 1.33. That said, I would like to explore a revision that moves back to the "abstract resource pools" model Kevin proposed very early on and is similar to those discussed here and here.

In particular, using that model instead will remove the confusion which we are working around in this KEP change by limiting ConsumesCapacityFrom to a single entry. Instead, every entry will be explicit as to which bucket its resources come from. More importantly, the current KEP assumes the advertised capacity of the device is identical to the capacity drawn down from the parent device. I don't think that will always be true. For example, in a different technology, there could be sharing overhead. So, you would consume 10GB but only advertise 9GB. Even in this example, there is no real need to advertise "memorySlice4" as a capacity in a MIG for end users. An abstract pool model would separate the "internal shared resources" from the "end user advertised resources".

keps/sig-node/4815-dra-partitionable-devices/README.md

mortent · 2025-02-03T21:01:34Z

@klueska This PR updates the KEP according to the discussion in https://kubernetes.slack.com/archives/C0409NGC1TK/p1737092728963179 where we discussed limiting the number of entries in the consumesCapacityFrom field to one. It also makes a few minor clarifications. Could you take a look?

thockin

Maybe surprising, but I don't hate it. :)

It's definitely hard to think about, so having some CLI to expand it all seems necessary.

One thing that might go well in this doc is a two-column view of a fully-flattened device and an equivalently mixin'ed device. I understand that would be tedious to do. Maybe add two YAML files in this repo and put the diff -u of them in the KEP in a ```diff block?

thockin · 2025-02-11T00:41:13Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+  // The maximum number of pools is 32.
+  //
+  // +optional
+  // listType=atomic


In the past we debated whether to use listType=listMap (@pohly) and I forget now your rationale. I bring it up now in the context of declarative validation. We can correlate updates across a listMap because it declare the listMapKey. For a listType=atomic, we don't have enough information. If you would like to eventually use declarative (I want you to :) then we need to decide if this should properly be a listMap or if we want to do some other mode like listType=correlatedAtomic or even decompose listMap into orthogonal concepts - correlateability (has a unique key) and single-owner vs. shared (represented in SSA as an opaque thing or not).

I know listMap also relates to SSA and I think CSA, right, Joe?

@jpbetz @aaron-prindle @yongruilin

For me, the deciding factor is whether this slice has one owner or many and how it gets updated.

When using listType=map, the tracking information for who owns which entry explodes. This is unnecessary here because there is a single owner, the DRA driver. There's also no need for patching specific entries.

Right, it was the managedFields argument. This is more a note for the declarative validation effort than for this KEP.

thockin · 2025-02-11T00:45:45Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+  DeviceMixins []DeviceMixin
+
+  // DeviceCapacityConsumptionMixins represents a list of capacity
+  // consumption mixins, each of which contains a set of capacities
+  // that a device will consume from a capacity pool.
+  //
+  // This makes it possible to define a set of shared capacities that
+  // are not tied to a specific pool. The pool is inferred by context
+  // in which the DeviceCapacityConsumptionMixin is referenced from
+  // the device.
+  //
+  // The total number of device mixins, device capacity consumption mixins, 
+  // capacity pool mixins, basic devices, and composite devices must be
+  // less than 128.
+  //
+  // +optional
+  // +listType=atomic
+  DeviceCapacityConsumptionMixins []DeviceCapacityConsumptionMixin
+
+  // CapacityPoolMixins represents a list of capacity pool mixins, i.e.
+  // a collection of capacities that a CapacityPool can "include"
+  // to extend the set of capacities it already defines.
+  //
+  // The main purposes of these mixins is to reduce the memory footprint
+  // of capacity pools since they can reference the mixins provided here rather
+  // than duplicate them.
+  //
+  // The total number of device mixins, device capacity consumption mixins, 
+  // capacity pool mixins, basic devices, and composite devices must be
+  // less than 128.
+  //
+  // +optional
+  // +listType=atomic
+  CapacityPoolMixins []CapacityPoolMixin


so each mixins[] would be a struct with device, capacityPool, deviceCapacityConsumption ?

I'm OK with that for the KEP and if it turns out to be unwieldy, we're not locked into it.

thockin · 2025-02-11T00:46:42Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+  DeviceMixins []DeviceMixin
+
+  // DeviceCapacityConsumptionMixins represents a list of capacity
+  // consumption mixins, each of which contains a set of capacities
+  // that a device will consume from a capacity pool.
+  //
+  // This makes it possible to define a set of shared capacities that
+  // are not tied to a specific pool. The pool is inferred by context
+  // in which the DeviceCapacityConsumptionMixin is referenced from
+  // the device.
+  //
+  // The total number of device mixins, device capacity consumption mixins, 
+  // capacity pool mixins, basic devices, and composite devices must be
+  // less than 128.
+  //
+  // +optional
+  // +listType=atomic
+  DeviceCapacityConsumptionMixins []DeviceCapacityConsumptionMixin
+
+  // CapacityPoolMixins represents a list of capacity pool mixins, i.e.
+  // a collection of capacities that a CapacityPool can "include"
+  // to extend the set of capacities it already defines.
+  //
+  // The main purposes of these mixins is to reduce the memory footprint
+  // of capacity pools since they can reference the mixins provided here rather
+  // than duplicate them.
+  //
+  // The total number of device mixins, device capacity consumption mixins, 
+  // capacity pool mixins, basic devices, and composite devices must be
+  // less than 128.
+  //
+  // +optional
+  // +listType=atomic
+  CapacityPoolMixins []CapacityPoolMixin


nit: can names be any simpler? deviceCapacityConsumption -> deviceConsumption ?

thockin · 2025-02-11T00:52:18Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+  // Conflicting attributes from those provided via other mixins are
+  // overwritten by the ones provided here.
+  //
+  // The maximum number of attributes and capacities combined is 32.


I ACk that limits are needed, but do we ultimately need a limit on the fully unpacked device?

Good question. I think the primary concern we try to address is the size limit in etcd, but that doesn't really apply to the fully unpacked device (since we never write that to etcd). I'm not sure if it might impact the performance of evaluating the CEL expressions, but I wouldn't think so.

I think we have more flexibility for runtime costs and thus don't need a limit for the unpacked device. If someone wants to use this feature for something complex, they have to be aware that evaluation will also become slower.

thockin · 2025-02-11T00:55:14Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+  // Name refers to the name of a device mixin in the pool.
+  //
+  // +required
+  Name string


What hypothetical other fields might go here?

I don't think we have any plans that would need it, but doing it this way does allow us to add it if necessary, ref #5066 (comment). It might be that we should just use a list of strings here instead to keep it as simple as possible and then just handle it if/when we do have an actual need for any additional fields.

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

thockin · 2025-02-11T01:00:42Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+  // The maximum number of attributes and capacities combined is 32.
+  //
+  // +optional
+  Capacity map[QualifiedName]DeviceCapacity


My $0.02 - it's fine as is.

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

thockin · 2025-02-11T01:18:20Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+  //
+  // +required
+  // +listType=atomic
+  ConsumesCapacity []DeviceCapacityConsumption


I don't recall a strong convention. A quick survey shows the past-tense is probably more consistent, but I am open to counter-envidence.

thockin · 2025-02-11T01:24:56Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

-"flatten" them into something that looks exactly the same as a `Basic` device.
-The scheduler can then allocate such devices using the same algorithm it uses
-for `Basic` devices.
+"flatten" them into a set devices that pull capacity from a set of pools. The


Can we include something like a kubectl cmd (not sure where it should go) or even a non-default plugin which does the flatten operation and prints the flattened thing?

Maybe we actually want a generic verb for this which fetches related things and prints them all together? kubectl flatten gateway foo

@kubernetes/sig-cli-leads

We have a similar concern over in #5034: I definitely see a need to improve the CLI to help users make sense of the information that is available because it's not all in one object or even a single type. But not for alpha, and perhaps even as a separate KEP if it's considered non-blocking for graduation of the functionality itself.

Yes, that's fair.

@robscott ever put thought into this topic from the Gateway side?

klueska · 2025-02-11T08:00:03Z

One thing that might go well in this doc is a two-column view of a fully-flattened device and an equivalently mixin'ed device. I understand that would be tedious to do. Maybe add two YAML files in this repo and put the diff -u of them in the KEP in a ```diff block?

@thockin I had put this together for the original KEP in 1.32:
kubernetes-sigs/wg-device-management#38

When ran, it generates all of the devices necessary to represent full GPUs and their MIG partitions on a mock DGXA100 machine. It then prints both the actual slice with its mixins as well as the "flattened" slice.

This is how the long example for MIG at the end of the KEP was generated.

It's probably worthwhile to update this to the new abstractions being proposed in this KEP and update that example in the process.

alculquicondor

It looks like the contentious parts are more about the API than the scheduling algorithm, so I'm going to approve for SIG Scheduling in advance.

/approve

thockin · 2025-02-11T16:52:22Z

I guess I am OK with the API. All my comments are minor.

/approve
/lgtm

johnbelamaric · 2025-02-11T17:00:46Z

Awesome.

/approve

klueska · 2025-02-11T17:03:05Z

/hold

Let's get the mixin update and the longer example at the end in before merging.

mortent · 2025-02-12T04:34:24Z

@klueska I've updated the KEP with the changes to the mixins and I have updated the large example.

@pohly I've also adressed your other comments. If this looks good to you, can you clear the "changes requested" status?

k8s-ci-robot · 2025-02-12T07:39:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, johnbelamaric, mortent, pohly, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-node/OWNERS~~ [johnbelamaric]
~~keps/sig-scheduling/OWNERS~~ [alculquicondor,johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2025-02-12T07:41:04Z

/lgtm

I'm keeping the hold for now because:

@klueska might also want to have another look.
I think it would be good to squash before lifting the hold.

klueska

Looking really good. Thanks for all the iterations on this @mortent. Just a few minor comments.

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

klueska · 2025-02-12T09:28:11Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+
+    1. The `Device` field is a list of named `DeviceMixin`s. These define
+      attributes and capacities that can be used to extend what is defined
+      explicitly in a `CompositeDevice`. Mixins can not be allocated directly,


This is the first time that CompositeDevice is mentioned in the revised version of this KEP.#

As such, let's reword this to the following:

Suggested change

explicitly in a `CompositeDevice`. Mixins can not be allocated directly,

explicitly in a new device type called `CompositeDevice` (introduced in more detail below). Mixins cannot be allocated directly,

Side note: for now in this KEP revision we assume that we keep CompositeDevice. That's correct, no need to change anything.

In https://docs.google.com/document/d/1sCfO4hJUUhZpzld-_wEZi85XqSclEVvjUfsEe2-duZs/edit?tab=t.0 I proposed some changes that would make it safer to flatten the Device struct. This is something that we can try to describe in the next KEP update (#5069), or treat it as an implementation choice and retro-actively update the KEP after we have agreed on the implementation in kubernetes/kubernetes#129970. Given the time constraints I am leaning towards the latter.

I agree that seems like the best option.

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

mortent · 2025-02-12T15:00:51Z

@klueska @pohly I have addressed the comments and squashed the commits.

klueska

/lgtm
/unhold

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jan 21, 2025

k8s-ci-robot requested review from dchen1107 and derekwaynecarr January 21, 2025 18:29

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 21, 2025

mortent force-pushed the ClarifyPartitionableDevices branch 3 times, most recently from 746feed to f69bf73 Compare January 22, 2025 01:33

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 22, 2025

mortent force-pushed the ClarifyPartitionableDevices branch 2 times, most recently from f23dfef to 773d91d Compare January 22, 2025 01:46

k8s-ci-robot assigned johnbelamaric, klueska and pohly Jan 22, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2025

pohly reviewed Jan 24, 2025

View reviewed changes

keps/sig-node/4815-dra-partitionable-devices/README.md Outdated Show resolved Hide resolved

mortent force-pushed the ClarifyPartitionableDevices branch from 773d91d to 8cc59aa Compare January 25, 2025 22:56

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 25, 2025

mortent commented Jan 25, 2025

View reviewed changes

keps/sig-node/4815-dra-partitionable-devices/README.md Outdated Show resolved Hide resolved

mortent force-pushed the ClarifyPartitionableDevices branch 4 times, most recently from 500252e to e5f3a54 Compare February 1, 2025 01:13

mortent mentioned this pull request Feb 1, 2025

KEP-4815 DRA Partitionable devices support for multi-host #5069

Merged

klueska mentioned this pull request Feb 3, 2025

DRA: Add support for partitionable devices #4815

Open

4 tasks

pohly mentioned this pull request Feb 10, 2025

Add KEP for DRA: Extended Resource #5136

Open

thockin reviewed Feb 11, 2025

View reviewed changes

alculquicondor reviewed Feb 11, 2025

View reviewed changes

k8s-ci-robot assigned thockin Feb 11, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2025

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 11, 2025

mortent force-pushed the ClarifyPartitionableDevices branch 2 times, most recently from f35118b to 4d980e6 Compare February 12, 2025 04:30

mortent requested a review from pohly February 12, 2025 04:31

pohly approved these changes Feb 12, 2025

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025

klueska reviewed Feb 12, 2025

View reviewed changes

KEP-4815 DRA Partitionable devices design update

d850e74

mortent force-pushed the ClarifyPartitionableDevices branch from 4d980e6 to d850e74 Compare February 12, 2025 14:58

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025

klueska reviewed Feb 12, 2025

View reviewed changes

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 12, 2025

k8s-ci-robot merged commit 54d0e74 into kubernetes:master Feb 12, 2025
4 checks passed

k8s-ci-robot added this to the v1.33 milestone Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-4815 DRA Partitionable devices design update #5066

KEP-4815 DRA Partitionable devices design update #5066

mortent commented Jan 21, 2025 •

edited

Loading

mortent commented Jan 22, 2025

johnbelamaric commented Jan 22, 2025

mortent commented Feb 3, 2025

thockin left a comment

thockin Feb 11, 2025

pohly Feb 11, 2025

thockin Feb 11, 2025

thockin Feb 11, 2025

thockin Feb 11, 2025

thockin Feb 11, 2025

mortent Feb 11, 2025

pohly Feb 12, 2025

thockin Feb 11, 2025

mortent Feb 11, 2025

thockin Feb 11, 2025

thockin Feb 11, 2025

thockin Feb 11, 2025

pohly Feb 11, 2025

thockin Feb 11, 2025

klueska commented Feb 11, 2025

alculquicondor left a comment

thockin commented Feb 11, 2025

johnbelamaric commented Feb 11, 2025

klueska commented Feb 11, 2025

mortent commented Feb 12, 2025

k8s-ci-robot commented Feb 12, 2025

pohly commented Feb 12, 2025

klueska left a comment

klueska Feb 12, 2025

pohly Feb 12, 2025 •

edited

Loading

mortent Feb 12, 2025

mortent commented Feb 12, 2025

klueska left a comment

	explicitly in a `CompositeDevice`. Mixins can not be allocated directly,
	explicitly in a new device type called `CompositeDevice` (introduced in more detail below). Mixins cannot be allocated directly,

KEP-4815 DRA Partitionable devices design update #5066

KEP-4815 DRA Partitionable devices design update #5066

Conversation

mortent commented Jan 21, 2025 • edited Loading

mortent commented Jan 22, 2025

johnbelamaric commented Jan 22, 2025

mortent commented Feb 3, 2025

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska commented Feb 11, 2025

alculquicondor left a comment

Choose a reason for hiding this comment

thockin commented Feb 11, 2025

johnbelamaric commented Feb 11, 2025

klueska commented Feb 11, 2025

mortent commented Feb 12, 2025

k8s-ci-robot commented Feb 12, 2025

pohly commented Feb 12, 2025

klueska left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mortent commented Feb 12, 2025

klueska left a comment

Choose a reason for hiding this comment

mortent commented Jan 21, 2025 •

edited

Loading

pohly Feb 12, 2025 •

edited

Loading