Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4815 DRA Partitionable devices design update #5066

Merged
merged 1 commit into from
Feb 12, 2025

Conversation

mortent
Copy link
Member

@mortent mortent commented Jan 21, 2025

  • One-line PR description: Update the API of the feature to use the abstract pools design.
  • Other comments: This PR also moves this feature from sig-node to sig-scheduling.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jan 21, 2025
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 21, 2025
@mortent mortent force-pushed the ClarifyPartitionableDevices branch 3 times, most recently from 746feed to f69bf73 Compare January 22, 2025 01:33
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 22, 2025
@mortent mortent force-pushed the ClarifyPartitionableDevices branch 2 times, most recently from f23dfef to 773d91d Compare January 22, 2025 01:46
@mortent
Copy link
Member Author

mortent commented Jan 22, 2025

/assign @pohly
/assign @klueska
/assign @johnbelamaric

@johnbelamaric
Copy link
Member

/lgtm

This change will ensure we get some implementation merged in 1.33. That said, I would like to explore a revision that moves back to the "abstract resource pools" model Kevin proposed very early on and is similar to those discussed here and here.

In particular, using that model instead will remove the confusion which we are working around in this KEP change by limiting ConsumesCapacityFrom to a single entry. Instead, every entry will be explicit as to which bucket its resources come from. More importantly, the current KEP assumes the advertised capacity of the device is identical to the capacity drawn down from the parent device. I don't think that will always be true. For example, in a different technology, there could be sharing overhead. So, you would consume 10GB but only advertise 9GB. Even in this example, there is no real need to advertise "memorySlice4" as a capacity in a MIG for end users. An abstract pool model would separate the "internal shared resources" from the "end user advertised resources".

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2025
@mortent mortent force-pushed the ClarifyPartitionableDevices branch from 773d91d to 8cc59aa Compare January 25, 2025 22:56
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 25, 2025
@mortent mortent force-pushed the ClarifyPartitionableDevices branch 4 times, most recently from 500252e to e5f3a54 Compare February 1, 2025 01:13
@mortent
Copy link
Member Author

mortent commented Feb 3, 2025

@klueska This PR updates the KEP according to the discussion in https://kubernetes.slack.com/archives/C0409NGC1TK/p1737092728963179 where we discussed limiting the number of entries in the consumesCapacityFrom field to one. It also makes a few minor clarifications. Could you take a look?

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe surprising, but I don't hate it. :)

It's definitely hard to think about, so having some CLI to expand it all seems necessary.

One thing that might go well in this doc is a two-column view of a fully-flattened device and an equivalently mixin'ed device. I understand that would be tedious to do. Maybe add two YAML files in this repo and put the diff -u of them in the KEP in a ```diff block?

// The maximum number of pools is 32.
//
// +optional
// listType=atomic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past we debated whether to use listType=listMap (@pohly) and I forget now your rationale. I bring it up now in the context of declarative validation. We can correlate updates across a listMap because it declare the listMapKey. For a listType=atomic, we don't have enough information. If you would like to eventually use declarative (I want you to :) then we need to decide if this should properly be a listMap or if we want to do some other mode like listType=correlatedAtomic or even decompose listMap into orthogonal concepts - correlateability (has a unique key) and single-owner vs. shared (represented in SSA as an opaque thing or not).

I know listMap also relates to SSA and I think CSA, right, Joe?

@jpbetz @aaron-prindle @yongruilin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, the deciding factor is whether this slice has one owner or many and how it gets updated.

When using listType=map, the tracking information for who owns which entry explodes. This is unnecessary here because there is a single owner, the DRA driver. There's also no need for patching specific entries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it was the managedFields argument. This is more a note for the declarative validation effort than for this KEP.

Comment on lines 353 to 386
DeviceMixins []DeviceMixin

// DeviceCapacityConsumptionMixins represents a list of capacity
// consumption mixins, each of which contains a set of capacities
// that a device will consume from a capacity pool.
//
// This makes it possible to define a set of shared capacities that
// are not tied to a specific pool. The pool is inferred by context
// in which the DeviceCapacityConsumptionMixin is referenced from
// the device.
//
// The total number of device mixins, device capacity consumption mixins,
// capacity pool mixins, basic devices, and composite devices must be
// less than 128.
//
// +optional
// +listType=atomic
DeviceCapacityConsumptionMixins []DeviceCapacityConsumptionMixin

// CapacityPoolMixins represents a list of capacity pool mixins, i.e.
// a collection of capacities that a CapacityPool can "include"
// to extend the set of capacities it already defines.
//
// The main purposes of these mixins is to reduce the memory footprint
// of capacity pools since they can reference the mixins provided here rather
// than duplicate them.
//
// The total number of device mixins, device capacity consumption mixins,
// capacity pool mixins, basic devices, and composite devices must be
// less than 128.
//
// +optional
// +listType=atomic
CapacityPoolMixins []CapacityPoolMixin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so each mixins[] would be a struct with device, capacityPool, deviceCapacityConsumption ?

I'm OK with that for the KEP and if it turns out to be unwieldy, we're not locked into it.

Comment on lines 353 to 386
DeviceMixins []DeviceMixin

// DeviceCapacityConsumptionMixins represents a list of capacity
// consumption mixins, each of which contains a set of capacities
// that a device will consume from a capacity pool.
//
// This makes it possible to define a set of shared capacities that
// are not tied to a specific pool. The pool is inferred by context
// in which the DeviceCapacityConsumptionMixin is referenced from
// the device.
//
// The total number of device mixins, device capacity consumption mixins,
// capacity pool mixins, basic devices, and composite devices must be
// less than 128.
//
// +optional
// +listType=atomic
DeviceCapacityConsumptionMixins []DeviceCapacityConsumptionMixin

// CapacityPoolMixins represents a list of capacity pool mixins, i.e.
// a collection of capacities that a CapacityPool can "include"
// to extend the set of capacities it already defines.
//
// The main purposes of these mixins is to reduce the memory footprint
// of capacity pools since they can reference the mixins provided here rather
// than duplicate them.
//
// The total number of device mixins, device capacity consumption mixins,
// capacity pool mixins, basic devices, and composite devices must be
// less than 128.
//
// +optional
// +listType=atomic
CapacityPoolMixins []CapacityPoolMixin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can names be any simpler? deviceCapacityConsumption -> deviceConsumption ?

// Conflicting attributes from those provided via other mixins are
// overwritten by the ones provided here.
//
// The maximum number of attributes and capacities combined is 32.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ACk that limits are needed, but do we ultimately need a limit on the fully unpacked device?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I think the primary concern we try to address is the size limit in etcd, but that doesn't really apply to the fully unpacked device (since we never write that to etcd). I'm not sure if it might impact the performance of evaluating the CEL expressions, but I wouldn't think so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have more flexibility for runtime costs and thus don't need a limit for the unpacked device. If someone wants to use this feature for something complex, they have to be aware that evaluation will also become slower.

// Name refers to the name of a device mixin in the pool.
//
// +required
Name string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What hypothetical other fields might go here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have any plans that would need it, but doing it this way does allow us to add it if necessary, ref #5066 (comment). It might be that we should just use a list of strings here instead to keep it as simple as possible and then just handle it if/when we do have an actual need for any additional fields.

// The maximum number of attributes and capacities combined is 32.
//
// +optional
Capacity map[QualifiedName]DeviceCapacity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My $0.02 - it's fine as is.

//
// +required
// +listType=atomic
ConsumesCapacity []DeviceCapacityConsumption
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall a strong convention. A quick survey shows the past-tense is probably more consistent, but I am open to counter-envidence.

"flatten" them into something that looks exactly the same as a `Basic` device.
The scheduler can then allocate such devices using the same algorithm it uses
for `Basic` devices.
"flatten" them into a set devices that pull capacity from a set of pools. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include something like a kubectl cmd (not sure where it should go) or even a non-default plugin which does the flatten operation and prints the flattened thing?

Maybe we actually want a generic verb for this which fetches related things and prints them all together? kubectl flatten gateway foo

@kubernetes/sig-cli-leads

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a similar concern over in #5034: I definitely see a need to improve the CLI to help users make sense of the information that is available because it's not all in one object or even a single type. But not for alpha, and perhaps even as a separate KEP if it's considered non-blocking for graduation of the functionality itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's fair.

@robscott ever put thought into this topic from the Gateway side?

@klueska
Copy link
Contributor

klueska commented Feb 11, 2025

One thing that might go well in this doc is a two-column view of a fully-flattened device and an equivalently mixin'ed device. I understand that would be tedious to do. Maybe add two YAML files in this repo and put the diff -u of them in the KEP in a ```diff block?

@thockin I had put this together for the original KEP in 1.32:
kubernetes-sigs/wg-device-management#38

When ran, it generates all of the devices necessary to represent full GPUs and their MIG partitions on a mock DGXA100 machine. It then prints both the actual slice with its mixins as well as the "flattened" slice.

This is how the long example for MIG at the end of the KEP was generated.

It's probably worthwhile to update this to the new abstractions being proposed in this KEP and update that example in the process.

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the contentious parts are more about the API than the scheduling algorithm, so I'm going to approve for SIG Scheduling in advance.

/approve

@thockin
Copy link
Member

thockin commented Feb 11, 2025

I guess I am OK with the API. All my comments are minor.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2025
@johnbelamaric
Copy link
Member

Awesome.

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2025
@klueska
Copy link
Contributor

klueska commented Feb 11, 2025

/hold

Let's get the mixin update and the longer example at the end in before merging.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 11, 2025
@mortent mortent force-pushed the ClarifyPartitionableDevices branch 2 times, most recently from f35118b to 4d980e6 Compare February 12, 2025 04:30
@mortent mortent requested a review from pohly February 12, 2025 04:31
@mortent
Copy link
Member Author

mortent commented Feb 12, 2025

@klueska I've updated the KEP with the changes to the mixins and I have updated the large example.

@pohly I've also adressed your other comments. If this looks good to you, can you clear the "changes requested" status?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, johnbelamaric, mortent, pohly, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pohly
Copy link
Contributor

pohly commented Feb 12, 2025

/lgtm

I'm keeping the hold for now because:

  • @klueska might also want to have another look.
  • I think it would be good to squash before lifting the hold.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025
Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good. Thanks for all the iterations on this @mortent. Just a few minor comments.


1. The `Device` field is a list of named `DeviceMixin`s. These define
attributes and capacities that can be used to extend what is defined
explicitly in a `CompositeDevice`. Mixins can not be allocated directly,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time that CompositeDevice is mentioned in the revised version of this KEP.#

As such, let's reword this to the following:

Suggested change
explicitly in a `CompositeDevice`. Mixins can not be allocated directly,
explicitly in a new device type called `CompositeDevice` (introduced in more detail below). Mixins cannot be allocated directly,

Copy link
Contributor

@pohly pohly Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: for now in this KEP revision we assume that we keep CompositeDevice. That's correct, no need to change anything.

In https://docs.google.com/document/d/1sCfO4hJUUhZpzld-_wEZi85XqSclEVvjUfsEe2-duZs/edit?tab=t.0 I proposed some changes that would make it safer to flatten the Device struct. This is something that we can try to describe in the next KEP update (#5069), or treat it as an implementation choice and retro-actively update the KEP after we have agreed on the implementation in kubernetes/kubernetes#129970. Given the time constraints I am leaning towards the latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that seems like the best option.

@mortent mortent force-pushed the ClarifyPartitionableDevices branch from 4d980e6 to d850e74 Compare February 12, 2025 14:58
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025
@mortent
Copy link
Member Author

mortent commented Feb 12, 2025

@klueska @pohly I have addressed the comments and squashed the commits.

Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/unhold

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 12, 2025
@k8s-ci-robot k8s-ci-robot merged commit 54d0e74 into kubernetes:master Feb 12, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

9 participants