-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-5027 + 5055: DRA: admin-controlled device attributes + taints #5034
KEP-5027 + 5055: DRA: admin-controlled device attributes + taints #5034
Conversation
/cc @KobayashiD27 For the "device priority" use case. /cc @byako For device health. |
@pohly: GitHub didn't allow me to request PR reviews from the following users: KobayashiD27. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md
Outdated
Show resolved
Hide resolved
/wg device-management |
keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md
Outdated
Show resolved
Hide resolved
531a905
to
cddc84f
Compare
c4a6f66
to
41cdbf5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was earlier discussion of common (driver independent) tool(ing) for listing, adding and removing device taints. Would it make sense to mention something about that in the tainting KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deeply appreciated for your quick action for device taints/tolerations KEP!! I left some comments. PTAL.
keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md
Outdated
Show resolved
Hide resolved
|
||
During version skew where the apiserver supports the feature and the scheduler | ||
doesn't, taints can be set without encountering errors or | ||
warnings, but they won't have any effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't taints persisted? If the component responsible for turning down pods is using the newer version, scheduler would keep scheduling pods there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduling would succeed, but only to have pods evicted by the controller for NoExecute taints - assuming that the controller isn't also a component where the feature is off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for the time of this version skew, scheduler will see a resource as available, but controller will constantly turn scheduled pods down. Not sure how much this is a problem, probably depending for how long such skew can last.
That's because 5027 describes the patching mechanism which is also needed for 5055. The patching of attributes and capacities is done in 20 lines of code (there's a prototype).
I'm still not sure what you mean with that. If you are suggesting that your in-progress #5149 should be used instead as basis, then I disagree. |
/lgtm |
# of http://git.k8s.io/enhancements/OWNERS_ALIASES | ||
kep-number: 5055 | ||
alpha: | ||
approver: "@johnbelamaric" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to @soltysh as discussed in #prod-readiness.
keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5055-dra-device-taints-and-tolerations/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5055-dra-device-taints-and-tolerations/kep.yaml
Outdated
Show resolved
Hide resolved
|
||
Items marked with (R) are required *prior to targeting to a milestone / release*. | ||
|
||
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catch-22? I cannot link to a directory in the issue because this PR creates it, and I cannot set the checkmark prior to merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
for sig scheduling
it is preserved even when a ResourceSlice gets removed or updated. To achieve | ||
this, a new cluster-scoped ResourceSlicePatch type gets added. A single | ||
ResourceSlicePatch object specifies device attributes that apply to all | ||
devices matching a CEL expression (i.e. the same way as users select devices in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't we overdoing it with the CEL? (any CEL parsing and matching will be relatively expensive)
Can't we use label selector, along with the information listed next?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Devices don't have labels to select on. For selecting a device by its attributes we have to use CEL.
One use case which might need a device attribute is "taint this particular device with UID 12345". That particular device might be exposed under a specific stable name in a ResourceSlice (e.g. "gpu-0" for "PCI slot 0") and the taint no longer applies once the device gets swapped out. But I am speculating. We need feedback before we can decide wether it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Devices don't have labels to select on
Perhaps they should. But again: YAGNI should prevail. Worth discussing, but I'll leave it up to you.
Helper code which keeps an up-to-date list of devices with all patches added | ||
to them will be provided as part of `k8s.io/dynamic-resource-allocation`. It | ||
will be based on informers such that evaluating the filter only is | ||
necessary when ResourceSlices or ResourceSlicePatches change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if multiple patches match the same resourceslice and they have conflicting information?
Please clarify how that is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is explained under DevicePatch.Priority
:
// If a ResourceSlice and a DevicePatch define the same attribute or
// capacity, the value of the DevicePatch is used. If multiple
// different DevicePatches match the same device, then the one with
// the highest priority wins. If priorities are equal, the older
// patch wins. This ensures that adding a new patch does not
// accidentally change the effect of some existing patch unless
// that is clearly intended according to the priority.
//
// +optional
Priority *int
I was discussing with @dom4ha whether it should be "newest one wins" in #5034 (comment) and he convinced me that "don't accidentally overwrite existing patch" is the better semantic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like "the oldest patch wins", it sounds less prone to race conditions.
But it should still probably be discouraged in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"newest one wins
With that you mean by creation timestamp or how is newer defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. "Last modification time" might also be useful, but we don't have that.
Clarified and added the "discouraged" part because choosing different priorities is clearer.
// +optional | ||
// +listType=atomic | ||
// +featureGate=DRADeviceTaints | ||
Taints []DeviceTaint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any API markers here to enforce 8 limit?
We can add kubebuilder validations for out-of-tree controllers..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any API markers here to enforce 8 limit?
I think such markers exist for generating CRDs, but they wouldn't have any effect in-tree.
Also, why 8?
As usual, a more or less arbitrary choice. For additional discussion, see #5034 (comment). Because ResourceSlice is size-constrained, I made it a bit smaller than suggested there.
2bbe533
to
86824bf
Compare
keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRR for both documents looks mostly good, but I left a few questions
whereas health monitoring might prefer to be specific and use a vendor-defined | ||
unique ID. Both are supported, which creates additional complexity. | ||
|
||
Without a kubectl extension similar to `kubectl taint nodes`, the user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't call that a risk 😉 ,but more like an inconvenience. I do miss the risk that a device is tainted and users bypass that (you've mentioned that for testing). What happens/are you protecting from users bypassing the taint by tolerating the taint in their ResourceClaim? That's a risk that's worth describing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't call that a risk 😉 ,but more like an inconvenience.
Building something that is unusable is a risk, but yeah, I get your point 😁
What happens/are you protecting from users bypassing the taint by tolerating the taint in their ResourceClaim?
Elsewhere (might have been in a comment, because I don't see it in the KEP) I said "users do that at their own risk", which is the same situation we have with node taints. Added here, with some more thoughts.
|
||
The intent to patch device attributes must be recorded persistently so that | ||
it is preserved even when a ResourceSlice gets removed or updated. To achieve | ||
this, a new cluster-scoped ResourceSlicePatch type gets added. A single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, the ResourceSlicePatch name isn't best, imo. It's very tightly coupled with patch operation. I see there was discussion around ResourceSliceOverrides, but maybe something generic like ResourceSliceModifiers would be nice? Definitely not a blocking comment. It's will most likely return during API review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something generic like ResourceSliceModifiers would be nice
I like that.
It will most likely return during API review.
Oh, absolutely. Naming always does.
We tried to do API reviews early for all KEPs, but this one here hasn't been reviewed by anyone API reviewer yet. Let's proceed with the risk that I will have to rename things during the implementation phase.
part of this proposal. | ||
|
||
Perhaps `kubectl describe resourceslices` can be extended to include the | ||
additional information. For now this is out of scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 that's one of the reason we have describe, to allow for a bit more expressive information.
Perhaps `kubectl describe resourceslices` can be extended to include the | ||
additional information. For now this is out of scope. | ||
|
||
Creating a ResourceSlicePatch is racing with on-going scheduling attempts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it should go to Risks section below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved and added some thoughts on mitigation (client-side evaluation was chosen because of this).
|
||
type ResourceSlicePatchSpec struct { | ||
// Devices defines how to patch device attributes and taints. | ||
Devices DevicePatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason you're not directly embedding the contents of DevicePatch struct here? Are there any potential extensions that would made you do it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for symmetry with ResourceSlice, which (theoretically) one day might get extended to describe something other than "a device". I know Tim hates this and certainly will remind me again... 😁
Helper code which keeps an up-to-date list of devices with all patches added | ||
to them will be provided as part of `k8s.io/dynamic-resource-allocation`. It | ||
will be based on informers such that evaluating the filter only is | ||
necessary when ResourceSlices or ResourceSlicePatches change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"newest one wins
With that you mean by creation timestamp or how is newer defined?
|
||
- `k8s.io/dynamic-resource-allocation/structured`: 91.3% | ||
- `k8s.io/kubernetes/pkg/apis/resource/validation`: 98.6% | ||
- `k8s.io/kubernetes/pkg/controller/tainteviction`: 81.8% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier in the doc you've mentioned that this will be entirely new controller, only similar to tainteviction. If so why including tainteviction controller here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to copy the tainteviction controller, then modify it. My aim is to have better coverage than the original GA tainteviction controller, so it seemed like a useful baseline.
###### What happens if we reenable the feature if it was previously rolled back? | ||
|
||
It takes effect again for scheduling. | ||
Running applications are not affected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I might be missing something, but iiuc if I create a ResourceSlicePatch, which will modify a ResourceSlice with a taint forcing eviction (and in combination with the controller from the other KEP) does it mean we can potentially evict running applications? Also, a followup question, should ResourceSlicePatch be taken into account by the new devicetainteviction controller (ie. the new one introduced in the other KEP here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and in combination with the controller from the other KEP)
This section is focused exclusively on admin-controlled attributes. Those don't cause eviction. Basically ignore that taints exist and consider only the DRAAdminControlledDeviceAttributes.
Also, a followup question, should ResourceSlicePatch be taken into account by the new devicetainteviction controller (ie. the new one introduced in the other KEP here)?
Yes, absolutely. "Taints defined by an admin in a ResourceSlicePatch get added to the set of taints defined by the DRA driver in a ResourceSlice" is specified in the 5055 KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KEP 5055 also has "It takes effect again for scheduling and may evict pods." under "What happens if we reenable the feature if it was previously rolled back?", i.e. exactly the thing you asked about here.
Sorry for the confusion with the two KEPs, but I still think it's better than trying to do both features in one KEP (see #5034 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion with the two KEPs, but I still think it's better than trying to do both features in one KEP (see #5034 (comment)).
I'm fine treating them separately, in which case I'd probably say we should mention the fact that when 5055 is enabled this will evict pods. It's better to be explicit, than guess the interactions. Especially that we should assume someone is reading this document separately from the other one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine treating them separately, in which case I'd probably say we should mention the fact that when 5055 is enabled this will evict pods.
Makes sense. I've updated this section here and squashed:
Admin-controlled attributes take effect again for scheduling.
Running applications are not affected because allocations are
never updated once they are made.
Note that this is different for reenabling device taints (KEP 5055): that can
cause pod to get evicted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
|
||
###### What specific metrics should inform a rollback? | ||
|
||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both here and in the other document, I'd suggest thinking through potential metrics for measuring how this affects the workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some from the existing tainteviction controller and I am also thinking about how to expose performance of the patching mechanism, but would like to add that to the KEPs when targeting beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
the PRR
###### What happens if we reenable the feature if it was previously rolled back? | ||
|
||
It takes effect again for scheduling. | ||
Running applications are not affected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion with the two KEPs, but I still think it's better than trying to do both features in one KEP (see #5034 (comment)).
I'm fine treating them separately, in which case I'd probably say we should mention the fact that when 5055 is enabled this will evict pods. It's better to be explicit, than guess the interactions. Especially that we should assume someone is reading this document separately from the other one.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, pohly, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
These are two different KEPs that provide two features that can be enabled and disabled independently. However, both use the same new ResourceSliceOverride type and thus get described and implemented together.
d096544
to
7990805
Compare
/lgtm |
One-line PR description: DRA: admin-controlled device attributes + taints
Issue links:
Other comments: first revision
/cc @johnbelamaric