Skip to content

Conversation

@ajaysundark
Copy link

  • One-line PR description: adding new KEP
  • Other comments: Including feedback from API review to include probing mechanisms as a inherent part of the design.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 17, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 17, 2025
@ajaysundark
Copy link
Author

This design has been discussed with more folks, and below is the summary of the key feedback:

  1. The current design with a new, explicit API is likely not necessary for the identified use cases. The recommended path forward is to first explore a simpler design that does not require a new API. This decision can be revisited if there's a POC or use-cases that demonstrate that a simpler approach is impractical or introduces unforeseen complexities.
  2. There was a strong preference for using a node-local probing mechanism to report readiness. This approach is favored for high-fidelity signals and a better security posture compared to granting nodes/status patch permissions to multiple external agents.
  3. An alternate proposal based on global control (crd) for node readiness is undesirable due to the risk of large scale impact on misconfiguration.
  4. Admins typically know readiness requirements before node provisioning, mutable readiness-gates are not necessary. The conditions themselves are what may change.
  5. It is important to differentiate (at handling the readiness-states) between an agent being present and an agent failing.

Copy link
Member

@lmktfy lmktfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, we have a lot of existing extension points to allow building something a lot like this, but out of tree.

We put in those extension points for a reason. So, I think we should:

  • build this out of tree
  • add our voices to calls for better add-on management

Comment on lines 560 to 562
### Initial Taints without a Central API

This approach uses `--register-with-taints` to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition, due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Initial Taints without a Central API
This approach uses `--register-with-taints` to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition, due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers.
### Initial taints (replaced), with out-of-tree controller
This approach uses `--register-with-taints` to apply a single initial taint at startup. A controller then atomically sets a set of replacement taints (configured using a custom resource) and removes the initial taint.
For each replacement taint, each component is then responsible for removing its own taint.
This is easier to maintain (no in-tree code) but requires people to run an additional
controller in their cluster.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The out-of-tree controller option for managing the readiness-conditions are discussed as the next alternative. I added here with another top 'non-api' option that was considered - using 'prefixes' to identify readiness taints.

key: "readiness.k8s.io/network-pending"
effect: NoSchedule
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a CRD that defines a set of rules that map (custom) conditions to taints?

Yes, you can break your cluster with a single, misguided cluster-scoped policy, but we already have that in other places (eg ValidatingAdmissionPolicy).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lmktfy, thanks for the suggestion. Yes, I discussed this CRD but it was turned down last time due to the risk of misconfiguration. I developed a proof-of-concept on this here: https://github.com/ajaysundark/node-readiness-gate-controller, and following up with SIG-Node on this further. I'll update here in KEP with updates.

Note over NA, CNI: Node-Agent Probes for Readiness
NA->>CNI: Probe for readiness (e.g., check health endpoint)
CNI-->>NA: Report Ready
NA->>N: Patch status.conditions:<br/>network.k8s.io/CNIReady=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't make an assumption that network plugins use CNI.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmktfy Thanks, I agree with you. CNI here was just an illustrative example, do you think it could be mentioned as a note for clarity here?

@gnufied
Copy link
Member

gnufied commented Jun 20, 2025

Overall, we have a lot of existing extension points to allow building something a lot like this, but out of tree.

I am not sure about that. If a node is reporting Ready from kubelet and fully initialized by cloud-provider (if there is one), then scheduler, autoscaler etc will already consider node ready for scheduling. At which point, it may already be too late to wait for initialization of certain components (CSI drivers etc for our case) for scheduling purposes. The only other option is a taint, which requires modifying all other components (including user workload) that don't need CSI driver to tolerate the taint. This is why, we (sig-storage) would appreciate if we solved this via a in-tree proper solution.

@lmktfy
Copy link
Member

lmktfy commented Jun 20, 2025 via email

@ajaysundark ajaysundark mentioned this pull request Jul 8, 2025
4 tasks
@keithmattix
Copy link
Member

+1 to @gnufied - we see this in the mesh/chained CNI space as well where a cluster admin (who may or may not control the node startup conditions if they're using e.g. a cloud provider) wants to ensure new nodes aren't marked as ready until a network security daemon (like a mesh) is healthy. We seen especially nasty situations with node restarts where the CNI/network is healthy so the Node marks itself as ready without consulting this new daemon that was added after the cluster reached a stable state (And in the mesh case, we're secondary to the pod network). Having a way to dynamically register ourselves as a gate for readiness would alleviate some of these issues

@lmktfy
Copy link
Member

lmktfy commented Aug 28, 2025

+1 to @gnufied - we see this in the mesh/chained CNI space as well where a cluster admin (who may or may not control the node startup conditions if they're using e.g. a cloud provider) wants to ensure new nodes aren't marked as ready until a network security daemon (like a mesh) is healthy. We seen especially nasty situations with node restarts where the CNI/network is healthy so the Node marks itself as ready without consulting this new daemon that was added after the cluster reached a stable state (And in the mesh case, we're secondary to the pod network). Having a way to dynamically register ourselves as a gate for readiness would alleviate some of these issues

If taints aren't the right mechanism here, let's really clearly articulate why.

@keithmattix
Copy link
Member

Ah right, forgot to add that. The major limitation of taints IMO is on node restart. Once the taint is removed post-readiness, something has to add it back as a gate, and a controller isn't sufficient for something like a node restart or any other sort of event where the daemon on the node may have its access to the apiserver impeded.

@gnufied
Copy link
Member

gnufied commented Aug 28, 2025

If taints aren't the right mechanism here, let's really clearly articulate why.

I thought I already did above for CSI. Tainting the node is fundamentally backward incompatible if all workloads don't need the said capability (say CSI storage). Forcing all workloads to be modified before cluster upgrade etc is untenable.

@lmktfy
Copy link
Member

lmktfy commented Aug 28, 2025

The major limitation of taints IMO is on node restart. Once the taint is removed post-readiness, something has to add it back as a gate, and a controller isn't sufficient for something like a node restart or any other sort of event where the daemon on the node may have its access to the apiserver impeded.

There is be a controller that may already be a great fit for this use case: MutatingAdmissionPolicy.

After a node restart, the kubelet can start up with a special taint, and then something removes that special taint (or, one of several startup taints). If we want to, we can use mutating admission to update the Node object. For example, if we see an update that is removing our important startup taint, but we don't have a current heartbeat as indicated in some annotation, we can add our own second taint purely via admission. It's a flexible way to achieve outcomes if we need it.

We would have to rely on kubelet here, but if kubelet can't be relied on to work as designed, that's a bigger problem. Similarly, we have to rely on kube-apiserver, but that's a required element of any conformant cluster.

Given that outline, I like that we can build that. It's something we can implement using with beta Kubernetes features that are available right now.

At the very least, let's list that approach as an alternative for the KEP. The PR says:

This approach uses --register-with-taints to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition, due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers.

…but I'm not 100% convinced. I think "Improve the way tolerations work" is a viable alternative that deserves genuine consideration.


Why? https://xkcd.com/927/

XKCD comic "standards"

The more mechanisms we have to signal node readiness (conditions, taints, node readiness gates, …) the harder we make it for learners. I just don't see that as fair on them.

It may be more work to make tolerations ergonomic, for maintainers. But that extra work scales much better than explaining this to hundreds or thousands of cluster admins.

@k8s-ci-robot k8s-ci-robot added area/enhancements Issues or PRs related to the Enhancements subproject sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Oct 6, 2025
@k8s-ci-robot k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Oct 6, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Oct 6, 2025
@k8s-ci-robot k8s-ci-robot added sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Oct 6, 2025
@k8s-ci-robot k8s-ci-robot added the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Oct 6, 2025
@k8s-ci-robot k8s-ci-robot removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Oct 6, 2025
@ajaysundark
Copy link
Author

/remove-sig instrumentation

@k8s-ci-robot k8s-ci-robot removed the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Oct 6, 2025
@ajaysundark ajaysundark force-pushed the kep_5233_nodereadinessgates_v134 branch from 5818390 to cb2f881 Compare October 8, 2025 18:58
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRR - a few questions

@helayoty helayoty moved this to In Progress in SIG Scheduling Oct 15, 2025
@johnbelamaric
Copy link
Member

Ok, I think PRR is fine now, I will approve once the SIG approval is in.

* **Alpha:**
* Feature implemented behind `NodeReadinessGates` feature gate, default `false`.
* Basic unit and integration tests implemented.
* Initial API definition (`NodeSpec.readinessGates`) available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to start with the new field? Can we try out without the new field and see if approach works while in alpha?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I updated the KEP to not use API but rely on existing register-with-taints api for declaring readiness taints. This proposal suggests a new readiness controller to manage removal of taints with prefix readiness.k8s.io/ if there are matching NodeStatus.Conditions.

Copy link
Member

@dom4ha dom4ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks good from the sig-scheduling perspective. Since there are no scheduler changes, our approval is not needed.

@ajaysundark
Copy link
Author

Based on feedback to validate the approach in Alpha first, this KEP has been updated to defer the introduction of a new API field. The proposal now leverages the existing
--register-with-taints mechanism to implement readiness gates, allowing to test the core functionality without immediate API changes.

@ajaysundark
Copy link
Author

ajaysundark commented Oct 16, 2025

The changes looks good from the sig-scheduling perspective. Since there are no scheduler changes, our approval is not needed.

@dom4ha Thanks for the review, the proposal is now updated to not introduce new api, but reuse bootstrap-taints for expressing readiness-requirements. This will be handled by a new controller managing taints, there's no scheduler impact. I removed SIG-Scheduling off the approvers list as you suggested.

@ajaysundark
Copy link
Author

To better align with other in-flight KEPs, I removed the proposal to add a new probing framework directly to the Kubelet. This KEP focus only on the NodeReadiness mechanism with a controller. NPD leveraging the new local Pod Readiness API (proposed in KEP-4188) to query and report node conditions is a more natural preferred path for readiness reporting mechanism.

cc: @matthyx , @briansonnenberg

@ajaysundark
Copy link
Author

/test pull-enhancements-verify

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this KEP is worth pursuing as a valuable experiment with the scope of external controller and NPD publishing state. I am not sure we need a KEP to track this experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/enhancements Issues or PRs related to the Enhancements subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: Needs Triage
Status: Needs Triage
Status: In Progress

Development

Successfully merging this pull request may close these issues.

10 participants