KEP-5233: proposal for NodeReadinessGates #5416

ajaysundark · 2025-06-17T05:52:32Z

One-line PR description: adding new KEP

Issue link: Node Readiness Gates #5233

Other comments: Including feedback from API review to include probing mechanisms as a inherent part of the design.

ajaysundark · 2025-06-17T06:16:45Z

This design has been discussed with more folks, and below is the summary of the key feedback:

The current design with a new, explicit API is likely not necessary for the identified use cases. The recommended path forward is to first explore a simpler design that does not require a new API. This decision can be revisited if there's a POC or use-cases that demonstrate that a simpler approach is impractical or introduces unforeseen complexities.
There was a strong preference for using a node-local probing mechanism to report readiness. This approach is favored for high-fidelity signals and a better security posture compared to granting nodes/status patch permissions to multiple external agents.
An alternate proposal based on global control (crd) for node readiness is undesirable due to the risk of large scale impact on misconfiguration.
Admins typically know readiness requirements before node provisioning, mutable readiness-gates are not necessary. The conditions themselves are what may change.
It is important to differentiate (at handling the readiness-states) between an agent being present and an agent failing.

keps/sig-node/5233-node-readiness-gates/kep.yaml

lmktfy

Overall, we have a lot of existing extension points to allow building something a lot like this, but out of tree.

We put in those extension points for a reason. So, I think we should:

build this out of tree
add our voices to calls for better add-on management

lmktfy · 2025-06-19T11:36:39Z

keps/sig-node/5233-node-readiness-gates/README.md

+### Initial Taints without a Central API
+
+This approach uses `--register-with-taints` to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition,  due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers. 


Suggested change

### Initial Taints without a Central API

This approach uses `--register-with-taints` to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition, due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers.

### Initial taints (replaced), with out-of-tree controller

This approach uses `--register-with-taints` to apply a single initial taint at startup. A controller then atomically sets a set of replacement taints (configured using a custom resource) and removes the initial taint.

For each replacement taint, each component is then responsible for removing its own taint.

This is easier to maintain (no in-tree code) but requires people to run an additional

controller in their cluster.

The out-of-tree controller option for managing the readiness-conditions are discussed as the next alternative. I added here with another top 'non-api' option that was considered - using 'prefixes' to identify readiness taints.

keps/sig-node/5233-node-readiness-gates/README.md

lmktfy · 2025-06-19T11:38:45Z

keps/sig-node/5233-node-readiness-gates/README.md

+    key: "readiness.k8s.io/network-pending"
+    effect: NoSchedule  
+```
+


How about a CRD that defines a set of rules that map (custom) conditions to taints?

Yes, you can break your cluster with a single, misguided cluster-scoped policy, but we already have that in other places (eg ValidatingAdmissionPolicy).

Hi @lmktfy, thanks for the suggestion. Yes, I discussed this CRD but it was turned down last time due to the risk of misconfiguration. I developed a proof-of-concept on this here: https://github.com/ajaysundark/node-readiness-gate-controller, and following up with SIG-Node on this further. I'll update here in KEP with updates.

keps/sig-node/5233-node-readiness-gates/README.md

lmktfy · 2025-06-19T11:41:11Z

keps/sig-node/5233-node-readiness-gates/README.md

+    Note over NA, CNI: Node-Agent Probes for Readiness
+    NA->>CNI: Probe for readiness (e.g., check health endpoint)
+    CNI-->>NA: Report Ready
+    NA->>N: Patch status.conditions:<br/>network.k8s.io/CNIReady=True


We shouldn't make an assumption that network plugins use CNI.

@lmktfy Thanks, I agree with you. CNI here was just an illustrative example, do you think it could be mentioned as a note for clarity here?

keps/sig-node/5233-node-readiness-gates/README.md

gnufied · 2025-06-20T16:50:40Z

Overall, we have a lot of existing extension points to allow building something a lot like this, but out of tree.

I am not sure about that. If a node is reporting Ready from kubelet and fully initialized by cloud-provider (if there is one), then scheduler, autoscaler etc will already consider node ready for scheduling. At which point, it may already be too late to wait for initialization of certain components (CSI drivers etc for our case) for scheduling purposes. The only other option is a taint, which requires modifying all other components (including user workload) that don't need CSI driver to tolerate the taint. This is why, we (sig-storage) would appreciate if we solved this via a in-tree proper solution.

lmktfy · 2025-06-20T17:34:39Z

How sure are we that we couldn't report "this Node is all good to go, consider it healthy" using a (new) condition? That extension point already exists. We can also tell autoscalers to expect that condition either on all nodes, or just on nodes with a certain label - and draw inferences if the condition is wholly absent. For influencing the scheduler, we definitely have options. We can taint the node based on that condition (and optionally the label), which is the most obvious route. There are others. We can, in extremis, make a custom scheduler – I wouldn't. To manage the new condition, we write a (new) controller. If people like it, that can become part of k-c-m. As a supporting change, we can improve tolerations to take away any snags we find. If tolerations don't let us build this out-of-tree, let's fix tolerations. That makes it easier to teach, even if the eventual node readiness code moves in tree at some future point.

keithmattix · 2025-08-28T03:31:32Z

+1 to @gnufied - we see this in the mesh/chained CNI space as well where a cluster admin (who may or may not control the node startup conditions if they're using e.g. a cloud provider) wants to ensure new nodes aren't marked as ready until a network security daemon (like a mesh) is healthy. We seen especially nasty situations with node restarts where the CNI/network is healthy so the Node marks itself as ready without consulting this new daemon that was added after the cluster reached a stable state (And in the mesh case, we're secondary to the pod network). Having a way to dynamically register ourselves as a gate for readiness would alleviate some of these issues

lmktfy · 2025-08-28T07:46:06Z

+1 to @gnufied - we see this in the mesh/chained CNI space as well where a cluster admin (who may or may not control the node startup conditions if they're using e.g. a cloud provider) wants to ensure new nodes aren't marked as ready until a network security daemon (like a mesh) is healthy. We seen especially nasty situations with node restarts where the CNI/network is healthy so the Node marks itself as ready without consulting this new daemon that was added after the cluster reached a stable state (And in the mesh case, we're secondary to the pod network). Having a way to dynamically register ourselves as a gate for readiness would alleviate some of these issues

If taints aren't the right mechanism here, let's really clearly articulate why.

keithmattix · 2025-08-28T13:59:57Z

Ah right, forgot to add that. The major limitation of taints IMO is on node restart. Once the taint is removed post-readiness, something has to add it back as a gate, and a controller isn't sufficient for something like a node restart or any other sort of event where the daemon on the node may have its access to the apiserver impeded.

gnufied · 2025-08-28T14:08:00Z

If taints aren't the right mechanism here, let's really clearly articulate why.

I thought I already did above for CSI. Tainting the node is fundamentally backward incompatible if all workloads don't need the said capability (say CSI storage). Forcing all workloads to be modified before cluster upgrade etc is untenable.

lmktfy · 2025-08-28T16:15:05Z

The major limitation of taints IMO is on node restart. Once the taint is removed post-readiness, something has to add it back as a gate, and a controller isn't sufficient for something like a node restart or any other sort of event where the daemon on the node may have its access to the apiserver impeded.

There is be a controller that may already be a great fit for this use case: MutatingAdmissionPolicy.

After a node restart, the kubelet can start up with a special taint, and then something removes that special taint (or, one of several startup taints). If we want to, we can use mutating admission to update the Node object. For example, if we see an update that is removing our important startup taint, but we don't have a current heartbeat as indicated in some annotation, we can add our own second taint purely via admission. It's a flexible way to achieve outcomes if we need it.

We would have to rely on kubelet here, but if kubelet can't be relied on to work as designed, that's a bigger problem. Similarly, we have to rely on kube-apiserver, but that's a required element of any conformant cluster.

Given that outline, I like that we can build that. It's something we can implement using with beta Kubernetes features that are available right now.

At the very least, let's list that approach as an alternative for the KEP. The PR says:

This approach uses --register-with-taints to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition, due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers.

…but I'm not 100% convinced. I think "Improve the way tolerations work" is a viable alternative that deserves genuine consideration.

Why? https://xkcd.com/927/

The more mechanisms we have to signal node readiness (conditions, taints, node readiness gates, …) the harder we make it for learners. I just don't see that as fair on them.

It may be more work to make tolerations ergonomic, for maintainers. But that extra work scales much better than explaining this to hundreds or thousands of cluster admins.

keps/sig-node/5233-node-readiness-gates/README.md

ajaysundark · 2025-10-06T07:22:51Z

/remove-sig instrumentation

keps/prod-readiness/sig-node/5233.yaml

johnbelamaric

PRR - a few questions

keps/sig-node/5233-node-readiness-gates/README.md

johnbelamaric · 2025-10-15T19:33:57Z

Ok, I think PRR is fine now, I will approve once the SIG approval is in.

SergeyKanzhelev · 2025-10-16T05:42:53Z

keps/sig-node/5233-node-readiness-gates/README.md

+* **Alpha:**
+    * Feature implemented behind `NodeReadinessGates` feature gate, default `false`.
+    * Basic unit and integration tests implemented.
+    * Initial API definition (`NodeSpec.readinessGates`) available.


do we have to start with the new field? Can we try out without the new field and see if approach works while in alpha?

Thanks for the suggestion. I updated the KEP to not use API but rely on existing register-with-taints api for declaring readiness taints. This proposal suggests a new readiness controller to manage removal of taints with prefix readiness.k8s.io/ if there are matching NodeStatus.Conditions.

keps/sig-node/5233-node-readiness-gates/README.md

dom4ha

The changes looks good from the sig-scheduling perspective. Since there are no scheduler changes, our approval is not needed.

keps/sig-node/5233-node-readiness-gates/README.md

ajaysundark · 2025-10-16T22:13:56Z

Based on feedback to validate the approach in Alpha first, this KEP has been updated to defer the introduction of a new API field. The proposal now leverages the existing
--register-with-taints mechanism to implement readiness gates, allowing to test the core functionality without immediate API changes.

ajaysundark · 2025-10-16T22:19:12Z

The changes looks good from the sig-scheduling perspective. Since there are no scheduler changes, our approval is not needed.

@dom4ha Thanks for the review, the proposal is now updated to not introduce new api, but reuse bootstrap-taints for expressing readiness-requirements. This will be handled by a new controller managing taints, there's no scheduler impact. I removed SIG-Scheduling off the approvers list as you suggested.

ajaysundark · 2025-10-16T22:27:56Z

To better align with other in-flight KEPs, I removed the proposal to add a new probing framework directly to the Kubelet. This KEP focus only on the NodeReadiness mechanism with a controller. NPD leveraging the new local Pod Readiness API (proposed in KEP-4188) to query and report node conditions is a more natural preferred path for readiness reporting mechanism.

cc: @matthyx , @briansonnenberg

ajaysundark · 2025-10-16T22:30:56Z

/test pull-enhancements-verify

SergeyKanzhelev

I think this KEP is worth pursuing as a valuable experiment with the scope of external controller and NPD publishing state. I am not sure we need a KEP to track this experiment.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 17, 2025

k8s-ci-robot requested review from derekwaynecarr and mrunalp June 17, 2025 05:52

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 17, 2025

sanposhiho reviewed Jun 17, 2025

View reviewed changes

keps/sig-node/5233-node-readiness-gates/kep.yaml Outdated Show resolved Hide resolved

jsafrane mentioned this pull request Jun 17, 2025

Deletion of csi-node-plugin Pod causes driver entry to be removed from CSINode object; kube-scheduler schedules more than driver's allocatable kubernetes/kubernetes#126921

Open

lmktfy reviewed Jun 19, 2025

View reviewed changes

ajaysundark mentioned this pull request Jul 8, 2025

Node Readiness Gates #5233

Open

4 tasks

kannon92 reviewed Sep 28, 2025

View reviewed changes

keps/sig-node/5233-node-readiness-gates/README.md Show resolved Hide resolved

github-project-automation bot added this to SIG Apps Oct 6, 2025

k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Oct 6, 2025

github-project-automation bot moved this to Needs Triage in SIG Apps Oct 6, 2025

k8s-ci-robot added sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Oct 6, 2025

github-project-automation bot added this to SIG CLI Oct 6, 2025

k8s-ci-robot added the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Oct 6, 2025

k8s-ci-robot removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Oct 6, 2025

k8s-ci-robot removed the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Oct 6, 2025

kannon92 reviewed Oct 6, 2025

View reviewed changes

keps/prod-readiness/sig-node/5233.yaml Outdated Show resolved Hide resolved

ajaysundark mentioned this pull request Oct 7, 2025

REQUEST: Create kubernetes-sigs/node-readiness-controller kubernetes/org#5906

Open

ajaysundark added 2 commits October 8, 2025 18:57

KEP-5233: proposal for NodeReadinessGates

0432943

update kep-5233 reviewers

cb2f881

ajaysundark force-pushed the kep_5233_nodereadinessgates_v134 branch from 5818390 to cb2f881 Compare October 8, 2025 18:58

update toc for kep-5233

bffcd63

johnbelamaric reviewed Oct 15, 2025

View reviewed changes

keps/sig-node/5233-node-readiness-gates/README.md Outdated Show resolved Hide resolved

keps/sig-node/5233-node-readiness-gates/README.md Outdated Show resolved Hide resolved

keps/sig-node/5233-node-readiness-gates/README.md Outdated Show resolved Hide resolved

helayoty moved this to In Progress in SIG Scheduling Oct 15, 2025

address PRR feedback

5e3f10d

SergeyKanzhelev reviewed Oct 16, 2025

View reviewed changes

keps/sig-node/5233-node-readiness-gates/README.md Outdated Show resolved Hide resolved

dom4ha reviewed Oct 16, 2025

View reviewed changes

keps/sig-node/5233-node-readiness-gates/README.md Outdated Show resolved Hide resolved

keps/sig-node/5233-node-readiness-gates/README.md Outdated Show resolved Hide resolved

ajaysundark added 4 commits October 16, 2025 21:57

update kep-5233 to not rely on a new API

16012eb

kep-5233 updates

d8133ee

update toc

39b9d5b

remove obsolete diagram

9898192

remove kubelet probes from proposal

ecee092

ajaysundark added 3 commits October 16, 2025 22:29

fix spelling

27d0391

mark as implementable for review requirement

3c89c53

update toc

ab01dba

SergeyKanzhelev reviewed Oct 16, 2025

View reviewed changes

		### Initial Taints without a Central API

		This approach uses `--register-with-taints` to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition, due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers.

-### Initial Taints without a Central API
-This approach uses `--register-with-taints` to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition,  due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers.
+### Initial taints (replaced), with out-of-tree controller
+This approach uses `--register-with-taints` to apply a single initial taint at startup. A controller then atomically sets a set of replacement taints (configured using a custom resource) and removes the initial taint.
+For each replacement taint, each component is then responsible for removing its own taint.
+This is easier to maintain (no in-tree code) but requires people to run an additional
+controller in their cluster.

KEP-5233: proposal for NodeReadinessGates #5416

Are you sure you want to change the base?

KEP-5233: proposal for NodeReadinessGates #5416

Conversation

ajaysundark commented Jun 17, 2025

Uh oh!

ajaysundark commented Jun 17, 2025

Uh oh!

Uh oh!

lmktfy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gnufied commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmktfy commented Jun 20, 2025 via email

Uh oh!

keithmattix commented Aug 28, 2025

Uh oh!

lmktfy commented Aug 28, 2025

Uh oh!

keithmattix commented Aug 28, 2025

Uh oh!

gnufied commented Aug 28, 2025

Uh oh!

lmktfy commented Aug 28, 2025

Uh oh!

Uh oh!

ajaysundark commented Oct 6, 2025

Uh oh!

Uh oh!

johnbelamaric left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnbelamaric commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dom4ha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ajaysundark commented Oct 16, 2025

Uh oh!

ajaysundark commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajaysundark commented Oct 16, 2025

Uh oh!

ajaysundark commented Oct 16, 2025

Uh oh!

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

Uh oh!

gnufied commented Jun 20, 2025 •

edited

Loading

ajaysundark commented Oct 16, 2025 •

edited

Loading