Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-5027 + 5055: DRA: admin-controlled device attributes + taints #5034

Merged

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jan 10, 2025

/cc @johnbelamaric

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 10, 2025
@pohly
Copy link
Contributor Author

pohly commented Jan 12, 2025

/cc @KobayashiD27

For the "device priority" use case.

/cc @byako

For device health.

@k8s-ci-robot k8s-ci-robot requested a review from byako January 12, 2025 12:31
@k8s-ci-robot
Copy link
Contributor

@pohly: GitHub didn't allow me to request PR reviews from the following users: KobayashiD27.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @KobayashiD27

For the "device priority" use case.

/cc @byako

For device health.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@pohly
Copy link
Contributor Author

pohly commented Jan 13, 2025

/wg device-management
/sig node

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jan 13, 2025
@pohly pohly mentioned this pull request Jan 20, 2025
4 tasks
@pohly pohly force-pushed the dra-device-attribute-overrides branch from 531a905 to cddc84f Compare January 20, 2025 14:31
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 20, 2025
@pohly pohly force-pushed the dra-device-attribute-overrides branch 2 times, most recently from c4a6f66 to 41cdbf5 Compare January 20, 2025 14:44
Copy link

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was earlier discussion of common (driver independent) tool(ing) for listing, adding and removing device taints. Would it make sense to mention something about that in the tainting KEP?

@pohly pohly changed the title KEP-5027: DRA: admin-controlled device attributes KEP-5027: DRA: admin-controlled device attributes + device taints Jan 20, 2025
Copy link
Contributor

@everpeace everpeace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deeply appreciated for your quick action for device taints/tolerations KEP!! I left some comments. PTAL.

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 21, 2025

During version skew where the apiserver supports the feature and the scheduler
doesn't, taints can be set without encountering errors or
warnings, but they won't have any effect.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't taints persisted? If the component responsible for turning down pods is using the newer version, scheduler would keep scheduling pods there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduling would succeed, but only to have pods evicted by the controller for NoExecute taints - assuming that the controller isn't also a component where the feature is off.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the time of this version skew, scheduler will see a resource as available, but controller will constantly turn scheduled pods down. Not sure how much this is a problem, probably depending for how long such skew can last.

@pohly
Copy link
Contributor Author

pohly commented Feb 7, 2025

I'd consider deferring 5027, but go with 5055 in this cycle and use reservation API. It should be sufficient for achieving all goals. Note that 5027 is the most heavy and complex to implement.

That's because 5027 describes the patching mechanism which is also needed for 5055. The patching of attributes and capacities is done in 20 lines of code (there's a prototype).

use reservation API

I'm still not sure what you mean with that. If you are suggesting that your in-progress #5149 should be used instead as basis, then I disagree.

@dom4ha
Copy link
Member

dom4ha commented Feb 8, 2025

/lgtm
As discussed in https://kubernetes.slack.com/archives/C09TP78DV/p1739053351531189?thread_ts=1738793685.308189&cid=C09TP78DV, a possibility to use ResourceClaims for reservations is unclear and definitely not feasible to agree on it within a week, so withdrawing my suggestion.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2025
# of http://git.k8s.io/enhancements/OWNERS_ALIASES
kep-number: 5055
alpha:
approver: "@johnbelamaric"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh: John suggested that you might be able to take this and it's twin, 5027, for PRR.

I'll keep John listed here for now, but let me assign it to you:

/assign @soltysh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to @soltysh as discussed in #prod-readiness.


Items marked with (R) are required *prior to targeting to a milestone / release*.

- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catch-22? I cannot link to a directory in the issue because this PR creates it, and I cannot set the checkmark prior to merging.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2025
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
for sig scheduling

it is preserved even when a ResourceSlice gets removed or updated. To achieve
this, a new cluster-scoped ResourceSlicePatch type gets added. A single
ResourceSlicePatch object specifies device attributes that apply to all
devices matching a CEL expression (i.e. the same way as users select devices in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we overdoing it with the CEL? (any CEL parsing and matching will be relatively expensive)
Can't we use label selector, along with the information listed next?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devices don't have labels to select on. For selecting a device by its attributes we have to use CEL.

One use case which might need a device attribute is "taint this particular device with UID 12345". That particular device might be exposed under a specific stable name in a ResourceSlice (e.g. "gpu-0" for "PCI slot 0") and the taint no longer applies once the device gets swapped out. But I am speculating. We need feedback before we can decide wether it's needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devices don't have labels to select on

Perhaps they should. But again: YAGNI should prevail. Worth discussing, but I'll leave it up to you.

Helper code which keeps an up-to-date list of devices with all patches added
to them will be provided as part of `k8s.io/dynamic-resource-allocation`. It
will be based on informers such that evaluating the filter only is
necessary when ResourceSlices or ResourceSlicePatches change.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if multiple patches match the same resourceslice and they have conflicting information?
Please clarify how that is resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is explained under DevicePatch.Priority:

    // If a ResourceSlice and a DevicePatch define the same attribute or
    // capacity, the value of the DevicePatch is used. If multiple
    // different DevicePatches match the same device, then the one with
    // the highest priority wins. If priorities are equal, the older
    // patch wins. This ensures that adding a new patch does not
    // accidentally change the effect of some existing patch unless
    // that is clearly intended according to the priority.
    //
    // +optional
    Priority *int

I was discussing with @dom4ha whether it should be "newest one wins" in #5034 (comment) and he convinced me that "don't accidentally overwrite existing patch" is the better semantic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "the oldest patch wins", it sounds less prone to race conditions.
But it should still probably be discouraged in the documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"newest one wins

With that you mean by creation timestamp or how is newer defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. "Last modification time" might also be useful, but we don't have that.

Clarified and added the "discouraged" part because choosing different priorities is clearer.

// +optional
// +listType=atomic
// +featureGate=DRADeviceTaints
Taints []DeviceTaint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any API markers here to enforce 8 limit?

We can add kubebuilder validations for out-of-tree controllers..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why 8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any API markers here to enforce 8 limit?

I think such markers exist for generating CRDs, but they wouldn't have any effect in-tree.

Also, why 8?

As usual, a more or less arbitrary choice. For additional discussion, see #5034 (comment). Because ResourceSlice is size-constrained, I made it a bit smaller than suggested there.

@pohly pohly force-pushed the dra-device-attribute-overrides branch from 2bbe533 to 86824bf Compare February 11, 2025 17:00
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 11, 2025
@pohly pohly changed the title KEP-5027 + 5055: DRA: admin-controlled device attributes + device taints KEP-5027 + 5055: DRA: admin-controlled device attributes + taints Feb 11, 2025
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRR for both documents looks mostly good, but I left a few questions

whereas health monitoring might prefer to be specific and use a vendor-defined
unique ID. Both are supported, which creates additional complexity.

Without a kubectl extension similar to `kubectl taint nodes`, the user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call that a risk 😉 ,but more like an inconvenience. I do miss the risk that a device is tainted and users bypass that (you've mentioned that for testing). What happens/are you protecting from users bypassing the taint by tolerating the taint in their ResourceClaim? That's a risk that's worth describing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call that a risk 😉 ,but more like an inconvenience.

Building something that is unusable is a risk, but yeah, I get your point 😁

What happens/are you protecting from users bypassing the taint by tolerating the taint in their ResourceClaim?

Elsewhere (might have been in a comment, because I don't see it in the KEP) I said "users do that at their own risk", which is the same situation we have with node taints. Added here, with some more thoughts.


The intent to patch device attributes must be recorded persistently so that
it is preserved even when a ResourceSlice gets removed or updated. To achieve
this, a new cluster-scoped ResourceSlicePatch type gets added. A single
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, the ResourceSlicePatch name isn't best, imo. It's very tightly coupled with patch operation. I see there was discussion around ResourceSliceOverrides, but maybe something generic like ResourceSliceModifiers would be nice? Definitely not a blocking comment. It's will most likely return during API review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something generic like ResourceSliceModifiers would be nice

I like that.

It will most likely return during API review.

Oh, absolutely. Naming always does.

We tried to do API reviews early for all KEPs, but this one here hasn't been reviewed by anyone API reviewer yet. Let's proceed with the risk that I will have to rename things during the implementation phase.

part of this proposal.

Perhaps `kubectl describe resourceslices` can be extended to include the
additional information. For now this is out of scope.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 that's one of the reason we have describe, to allow for a bit more expressive information.

Perhaps `kubectl describe resourceslices` can be extended to include the
additional information. For now this is out of scope.

Creating a ResourceSlicePatch is racing with on-going scheduling attempts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it should go to Risks section below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved and added some thoughts on mitigation (client-side evaluation was chosen because of this).


type ResourceSlicePatchSpec struct {
// Devices defines how to patch device attributes and taints.
Devices DevicePatch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you're not directly embedding the contents of DevicePatch struct here? Are there any potential extensions that would made you do it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for symmetry with ResourceSlice, which (theoretically) one day might get extended to describe something other than "a device". I know Tim hates this and certainly will remind me again... 😁

Helper code which keeps an up-to-date list of devices with all patches added
to them will be provided as part of `k8s.io/dynamic-resource-allocation`. It
will be based on informers such that evaluating the filter only is
necessary when ResourceSlices or ResourceSlicePatches change.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"newest one wins

With that you mean by creation timestamp or how is newer defined?


- `k8s.io/dynamic-resource-allocation/structured`: 91.3%
- `k8s.io/kubernetes/pkg/apis/resource/validation`: 98.6%
- `k8s.io/kubernetes/pkg/controller/tainteviction`: 81.8%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier in the doc you've mentioned that this will be entirely new controller, only similar to tainteviction. If so why including tainteviction controller here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to copy the tainteviction controller, then modify it. My aim is to have better coverage than the original GA tainteviction controller, so it seemed like a useful baseline.

###### What happens if we reenable the feature if it was previously rolled back?

It takes effect again for scheduling.
Running applications are not affected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I might be missing something, but iiuc if I create a ResourceSlicePatch, which will modify a ResourceSlice with a taint forcing eviction (and in combination with the controller from the other KEP) does it mean we can potentially evict running applications? Also, a followup question, should ResourceSlicePatch be taken into account by the new devicetainteviction controller (ie. the new one introduced in the other KEP here)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and in combination with the controller from the other KEP)

This section is focused exclusively on admin-controlled attributes. Those don't cause eviction. Basically ignore that taints exist and consider only the DRAAdminControlledDeviceAttributes.

Also, a followup question, should ResourceSlicePatch be taken into account by the new devicetainteviction controller (ie. the new one introduced in the other KEP here)?

Yes, absolutely. "Taints defined by an admin in a ResourceSlicePatch get added to the set of taints defined by the DRA driver in a ResourceSlice" is specified in the 5055 KEP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KEP 5055 also has "It takes effect again for scheduling and may evict pods." under "What happens if we reenable the feature if it was previously rolled back?", i.e. exactly the thing you asked about here.

Sorry for the confusion with the two KEPs, but I still think it's better than trying to do both features in one KEP (see #5034 (comment)).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion with the two KEPs, but I still think it's better than trying to do both features in one KEP (see #5034 (comment)).

I'm fine treating them separately, in which case I'd probably say we should mention the fact that when 5055 is enabled this will evict pods. It's better to be explicit, than guess the interactions. Especially that we should assume someone is reading this document separately from the other one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine treating them separately, in which case I'd probably say we should mention the fact that when 5055 is enabled this will evict pods.

Makes sense. I've updated this section here and squashed:

Admin-controlled attributes take effect again for scheduling.
Running applications are not affected because allocations are
never updated once they are made.

Note that this is different for reenabling device taints (KEP 5055): that can
cause pod to get evicted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!


###### What specific metrics should inform a rollback?

<!--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both here and in the other document, I'd suggest thinking through potential metrics for measuring how this affects the workloads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some from the existing tainteviction controller and I am also thinking about how to expose performance of the patching mechanism, but would like to add that to the KEPs when targeting beta.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
the PRR

###### What happens if we reenable the feature if it was previously rolled back?

It takes effect again for scheduling.
Running applications are not affected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion with the two KEPs, but I still think it's better than trying to do both features in one KEP (see #5034 (comment)).

I'm fine treating them separately, in which case I'd probably say we should mention the fact that when 5055 is enabled this will evict pods. It's better to be explicit, than guess the interactions. Especially that we should assume someone is reading this document separately from the other one.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, pohly, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2025
These are two different KEPs that provide two features that can be enabled and
disabled independently. However, both use the same new ResourceSliceOverride
type and thus get described and implemented together.
@pohly pohly force-pushed the dra-device-attribute-overrides branch from d096544 to 7990805 Compare February 12, 2025 16:35
@johnbelamaric
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025
@k8s-ci-robot k8s-ci-robot merged commit 783b3ac into kubernetes:master Feb 12, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.