-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4815 DRA Partitionable devices design update #5066
KEP-4815 DRA Partitionable devices design update #5066
Conversation
mortent
commented
Jan 21, 2025
•
edited
Loading
edited
- One-line PR description: Update the API of the feature to use the abstract pools design.
- Issue link: DRA: Add support for partitionable devices #4815
- Other comments: This PR also moves this feature from sig-node to sig-scheduling.
746feed
to
f69bf73
Compare
f23dfef
to
773d91d
Compare
/assign @pohly |
/lgtm This change will ensure we get some implementation merged in 1.33. That said, I would like to explore a revision that moves back to the "abstract resource pools" model Kevin proposed very early on and is similar to those discussed here and here. In particular, using that model instead will remove the confusion which we are working around in this KEP change by limiting |
773d91d
to
8cc59aa
Compare
500252e
to
e5f3a54
Compare
@klueska This PR updates the KEP according to the discussion in https://kubernetes.slack.com/archives/C0409NGC1TK/p1737092728963179 where we discussed limiting the number of entries in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe surprising, but I don't hate it. :)
It's definitely hard to think about, so having some CLI to expand it all seems necessary.
One thing that might go well in this doc is a two-column view of a fully-flattened device and an equivalently mixin'ed device. I understand that would be tedious to do. Maybe add two YAML files in this repo and put the diff -u
of them in the KEP in a ```diff
block?
// The maximum number of pools is 32. | ||
// | ||
// +optional | ||
// listType=atomic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past we debated whether to use listType=listMap
(@pohly) and I forget now your rationale. I bring it up now in the context of declarative validation. We can correlate updates across a listMap because it declare the listMapKey. For a listType=atomic
, we don't have enough information. If you would like to eventually use declarative (I want you to :) then we need to decide if this should properly be a listMap or if we want to do some other mode like listType=correlatedAtomic
or even decompose listMap into orthogonal concepts - correlateability (has a unique key) and single-owner vs. shared (represented in SSA as an opaque thing or not).
I know listMap also relates to SSA and I think CSA, right, Joe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, the deciding factor is whether this slice has one owner or many and how it gets updated.
When using listType=map
, the tracking information for who owns which entry explodes. This is unnecessary here because there is a single owner, the DRA driver. There's also no need for patching specific entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it was the managedFields
argument. This is more a note for the declarative validation effort than for this KEP.
DeviceMixins []DeviceMixin | ||
|
||
// DeviceCapacityConsumptionMixins represents a list of capacity | ||
// consumption mixins, each of which contains a set of capacities | ||
// that a device will consume from a capacity pool. | ||
// | ||
// This makes it possible to define a set of shared capacities that | ||
// are not tied to a specific pool. The pool is inferred by context | ||
// in which the DeviceCapacityConsumptionMixin is referenced from | ||
// the device. | ||
// | ||
// The total number of device mixins, device capacity consumption mixins, | ||
// capacity pool mixins, basic devices, and composite devices must be | ||
// less than 128. | ||
// | ||
// +optional | ||
// +listType=atomic | ||
DeviceCapacityConsumptionMixins []DeviceCapacityConsumptionMixin | ||
|
||
// CapacityPoolMixins represents a list of capacity pool mixins, i.e. | ||
// a collection of capacities that a CapacityPool can "include" | ||
// to extend the set of capacities it already defines. | ||
// | ||
// The main purposes of these mixins is to reduce the memory footprint | ||
// of capacity pools since they can reference the mixins provided here rather | ||
// than duplicate them. | ||
// | ||
// The total number of device mixins, device capacity consumption mixins, | ||
// capacity pool mixins, basic devices, and composite devices must be | ||
// less than 128. | ||
// | ||
// +optional | ||
// +listType=atomic | ||
CapacityPoolMixins []CapacityPoolMixin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so each mixins[]
would be a struct with device
, capacityPool
, deviceCapacityConsumption
?
I'm OK with that for the KEP and if it turns out to be unwieldy, we're not locked into it.
DeviceMixins []DeviceMixin | ||
|
||
// DeviceCapacityConsumptionMixins represents a list of capacity | ||
// consumption mixins, each of which contains a set of capacities | ||
// that a device will consume from a capacity pool. | ||
// | ||
// This makes it possible to define a set of shared capacities that | ||
// are not tied to a specific pool. The pool is inferred by context | ||
// in which the DeviceCapacityConsumptionMixin is referenced from | ||
// the device. | ||
// | ||
// The total number of device mixins, device capacity consumption mixins, | ||
// capacity pool mixins, basic devices, and composite devices must be | ||
// less than 128. | ||
// | ||
// +optional | ||
// +listType=atomic | ||
DeviceCapacityConsumptionMixins []DeviceCapacityConsumptionMixin | ||
|
||
// CapacityPoolMixins represents a list of capacity pool mixins, i.e. | ||
// a collection of capacities that a CapacityPool can "include" | ||
// to extend the set of capacities it already defines. | ||
// | ||
// The main purposes of these mixins is to reduce the memory footprint | ||
// of capacity pools since they can reference the mixins provided here rather | ||
// than duplicate them. | ||
// | ||
// The total number of device mixins, device capacity consumption mixins, | ||
// capacity pool mixins, basic devices, and composite devices must be | ||
// less than 128. | ||
// | ||
// +optional | ||
// +listType=atomic | ||
CapacityPoolMixins []CapacityPoolMixin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can names be any simpler? deviceCapacityConsumption
-> deviceConsumption
?
// Conflicting attributes from those provided via other mixins are | ||
// overwritten by the ones provided here. | ||
// | ||
// The maximum number of attributes and capacities combined is 32. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ACk that limits are needed, but do we ultimately need a limit on the fully unpacked device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I think the primary concern we try to address is the size limit in etcd, but that doesn't really apply to the fully unpacked device (since we never write that to etcd). I'm not sure if it might impact the performance of evaluating the CEL expressions, but I wouldn't think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have more flexibility for runtime costs and thus don't need a limit for the unpacked device. If someone wants to use this feature for something complex, they have to be aware that evaluation will also become slower.
// Name refers to the name of a device mixin in the pool. | ||
// | ||
// +required | ||
Name string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What hypothetical other fields might go here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have any plans that would need it, but doing it this way does allow us to add it if necessary, ref #5066 (comment). It might be that we should just use a list of strings here instead to keep it as simple as possible and then just handle it if/when we do have an actual need for any additional fields.
// The maximum number of attributes and capacities combined is 32. | ||
// | ||
// +optional | ||
Capacity map[QualifiedName]DeviceCapacity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My $0.02 - it's fine as is.
// | ||
// +required | ||
// +listType=atomic | ||
ConsumesCapacity []DeviceCapacityConsumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall a strong convention. A quick survey shows the past-tense is probably more consistent, but I am open to counter-envidence.
"flatten" them into something that looks exactly the same as a `Basic` device. | ||
The scheduler can then allocate such devices using the same algorithm it uses | ||
for `Basic` devices. | ||
"flatten" them into a set devices that pull capacity from a set of pools. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include something like a kubectl cmd (not sure where it should go) or even a non-default plugin which does the flatten operation and prints the flattened thing?
Maybe we actually want a generic verb for this which fetches related things and prints them all together? kubectl flatten gateway foo
@kubernetes/sig-cli-leads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a similar concern over in #5034: I definitely see a need to improve the CLI to help users make sense of the information that is available because it's not all in one object or even a single type. But not for alpha, and perhaps even as a separate KEP if it's considered non-blocking for graduation of the functionality itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fair.
@robscott ever put thought into this topic from the Gateway side?
@thockin I had put this together for the original KEP in 1.32: When ran, it generates all of the devices necessary to represent full GPUs and their MIG partitions on a mock DGXA100 machine. It then prints both the actual slice with its mixins as well as the "flattened" slice. This is how the long example for MIG at the end of the KEP was generated. It's probably worthwhile to update this to the new abstractions being proposed in this KEP and update that example in the process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the contentious parts are more about the API than the scheduling algorithm, so I'm going to approve for SIG Scheduling in advance.
/approve
I guess I am OK with the API. All my comments are minor. /approve |
Awesome. /approve |
/hold Let's get the mixin update and the longer example at the end in before merging. |
f35118b
to
4d980e6
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, johnbelamaric, mortent, pohly, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm I'm keeping the hold for now because:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good. Thanks for all the iterations on this @mortent. Just a few minor comments.
|
||
1. The `Device` field is a list of named `DeviceMixin`s. These define | ||
attributes and capacities that can be used to extend what is defined | ||
explicitly in a `CompositeDevice`. Mixins can not be allocated directly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first time that CompositeDevice
is mentioned in the revised version of this KEP.#
As such, let's reword this to the following:
explicitly in a `CompositeDevice`. Mixins can not be allocated directly, | |
explicitly in a new device type called `CompositeDevice` (introduced in more detail below). Mixins cannot be allocated directly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: for now in this KEP revision we assume that we keep CompositeDevice
. That's correct, no need to change anything.
In https://docs.google.com/document/d/1sCfO4hJUUhZpzld-_wEZi85XqSclEVvjUfsEe2-duZs/edit?tab=t.0 I proposed some changes that would make it safer to flatten the Device
struct. This is something that we can try to describe in the next KEP update (#5069), or treat it as an implementation choice and retro-actively update the KEP after we have agreed on the implementation in kubernetes/kubernetes#129970. Given the time constraints I am leaning towards the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that seems like the best option.
4d980e6
to
d850e74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/unhold