-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5233: proposal for NodeReadinessGates #5416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP-5233: proposal for NodeReadinessGates #5416
Conversation
ajaysundark
commented
Jun 17, 2025
- One-line PR description: adding new KEP
- Issue link: Node Readiness Gates #5233
- Other comments: Including feedback from API review to include probing mechanisms as a inherent part of the design.
|
This design has been discussed with more folks, and below is the summary of the key feedback:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, we have a lot of existing extension points to allow building something a lot like this, but out of tree.
We put in those extension points for a reason. So, I think we should:
- build this out of tree
- add our voices to calls for better add-on management
| ### Initial Taints without a Central API | ||
|
|
||
| This approach uses `--register-with-taints` to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition, due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ### Initial Taints without a Central API | |
| This approach uses `--register-with-taints` to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition, due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers. | |
| ### Initial taints (replaced), with out-of-tree controller | |
| This approach uses `--register-with-taints` to apply a single initial taint at startup. A controller then atomically sets a set of replacement taints (configured using a custom resource) and removes the initial taint. | |
| For each replacement taint, each component is then responsible for removing its own taint. | |
| This is easier to maintain (no in-tree code) but requires people to run an additional | |
| controller in their cluster. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The out-of-tree controller option for managing the readiness-conditions are discussed as the next alternative. I added here with another top 'non-api' option that was considered - using 'prefixes' to identify readiness taints.
| key: "readiness.k8s.io/network-pending" | ||
| effect: NoSchedule | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a CRD that defines a set of rules that map (custom) conditions to taints?
Yes, you can break your cluster with a single, misguided cluster-scoped policy, but we already have that in other places (eg ValidatingAdmissionPolicy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lmktfy, thanks for the suggestion. Yes, I discussed this CRD but it was turned down last time due to the risk of misconfiguration. I developed a proof-of-concept on this here: https://github.com/ajaysundark/node-readiness-gate-controller, and following up with SIG-Node on this further. I'll update here in KEP with updates.
| Note over NA, CNI: Node-Agent Probes for Readiness | ||
| NA->>CNI: Probe for readiness (e.g., check health endpoint) | ||
| CNI-->>NA: Report Ready | ||
| NA->>N: Patch status.conditions:<br/>network.k8s.io/CNIReady=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't make an assumption that network plugins use CNI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmktfy Thanks, I agree with you. CNI here was just an illustrative example, do you think it could be mentioned as a note for clarity here?
I am not sure about that. If a node is reporting |
|
How sure are we that we couldn't report "this Node is all good to go, consider it healthy" using a (new) condition? That extension point already exists.
We can also tell autoscalers to expect that condition either on all nodes, or just on nodes with a certain label - and draw inferences if the condition is wholly absent.
For influencing the scheduler, we definitely have options. We can taint the node based on that condition (and optionally the label), which is the most obvious route. There are others. We can, in extremis, make a custom scheduler – I wouldn't.
To manage the new condition, we write a (new) controller. If people like it, that can become part of k-c-m.
As a supporting change, we can improve tolerations to take away any snags we find. If tolerations don't let us build this out-of-tree, let's fix tolerations. That makes it easier to teach, even if the eventual node readiness code moves in tree at some future point.
|
|
+1 to @gnufied - we see this in the mesh/chained CNI space as well where a cluster admin (who may or may not control the node startup conditions if they're using e.g. a cloud provider) wants to ensure new nodes aren't marked as ready until a network security daemon (like a mesh) is healthy. We seen especially nasty situations with node restarts where the CNI/network is healthy so the Node marks itself as ready without consulting this new daemon that was added after the cluster reached a stable state (And in the mesh case, we're secondary to the pod network). Having a way to dynamically register ourselves as a gate for readiness would alleviate some of these issues |
If taints aren't the right mechanism here, let's really clearly articulate why. |
|
Ah right, forgot to add that. The major limitation of taints IMO is on node restart. Once the taint is removed post-readiness, something has to add it back as a gate, and a controller isn't sufficient for something like a node restart or any other sort of event where the daemon on the node may have its access to the apiserver impeded. |
I thought I already did above for CSI. Tainting the node is fundamentally backward incompatible if all workloads don't need the said capability (say CSI storage). Forcing all workloads to be modified before cluster upgrade etc is untenable. |
There is be a controller that may already be a great fit for this use case: MutatingAdmissionPolicy. After a node restart, the kubelet can start up with a special taint, and then something removes that special taint (or, one of several startup taints). If we want to, we can use mutating admission to update the Node object. For example, if we see an update that is removing our important startup taint, but we don't have a current heartbeat as indicated in some annotation, we can add our own second taint purely via admission. It's a flexible way to achieve outcomes if we need it. We would have to rely on kubelet here, but if kubelet can't be relied on to work as designed, that's a bigger problem. Similarly, we have to rely on kube-apiserver, but that's a required element of any conformant cluster. Given that outline, I like that we can build that. It's something we can implement using with beta Kubernetes features that are available right now. At the very least, let's list that approach as an alternative for the KEP. The PR says:
…but I'm not 100% convinced. I think "Improve the way tolerations work" is a viable alternative that deserves genuine consideration.
The more mechanisms we have to signal node readiness (conditions, taints, node readiness gates, …) the harder we make it for learners. I just don't see that as fair on them. It may be more work to make tolerations ergonomic, for maintainers. But that extra work scales much better than explaining this to hundreds or thousands of cluster admins. |
|
/remove-sig instrumentation |
5818390 to
cb2f881
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRR - a few questions
|
Ok, I think PRR is fine now, I will approve once the SIG approval is in. |
| * **Alpha:** | ||
| * Feature implemented behind `NodeReadinessGates` feature gate, default `false`. | ||
| * Basic unit and integration tests implemented. | ||
| * Initial API definition (`NodeSpec.readinessGates`) available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have to start with the new field? Can we try out without the new field and see if approach works while in alpha?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I updated the KEP to not use API but rely on existing register-with-taints api for declaring readiness taints. This proposal suggests a new readiness controller to manage removal of taints with prefix readiness.k8s.io/ if there are matching NodeStatus.Conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes looks good from the sig-scheduling perspective. Since there are no scheduler changes, our approval is not needed.
|
Based on feedback to validate the approach in Alpha first, this KEP has been updated to defer the introduction of a new API field. The proposal now leverages the existing |
@dom4ha Thanks for the review, the proposal is now updated to not introduce new api, but reuse bootstrap-taints for expressing readiness-requirements. This will be handled by a new controller managing taints, there's no scheduler impact. I removed SIG-Scheduling off the approvers list as you suggested. |
|
To better align with other in-flight KEPs, I removed the proposal to add a new probing framework directly to the Kubelet. This KEP focus only on the NodeReadiness mechanism with a controller. NPD leveraging the new local Pod Readiness API (proposed in KEP-4188) to query and report node conditions is a more natural preferred path for readiness reporting mechanism. cc: @matthyx , @briansonnenberg |
|
/test pull-enhancements-verify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this KEP is worth pursuing as a valuable experiment with the scope of external controller and NPD publishing state. I am not sure we need a KEP to track this experiment.
