You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/3953-node-resource-hot-plug/README.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -111,6 +111,10 @@ to be force migrated to a different node, causing a temporary spike in the CPU/M
111
111
With the current state of implementation in the Kubernetes realm, the available workarounds to allow the cluster to be aware of the changes in the cluster's capacity is by
112
112
restarting the node or at-least restarting the kubelet, which does not have a certain set of best-practices to follow.
113
113
114
+
One of the major drawback of current system is around timing and coupling the kubelet restart to detect the available resources. For example, in cloud environment, if a node admin can scale up the node,
115
+
it may hard to time the completion of resize operation based on the added capacity and following the restart of kubelet.
116
+
Also there may exist few APIs provided by cloud SDK to perform resize operation coupling it with kubelet restart is not seamless.
117
+
114
118
However, this approach does carry a few drawbacks such as
115
119
- Introducing a downtime for the existing/to-be-scheduled workloads on the cluster until the node is available.
116
120
- For baremetal clusters it involves significant amount time for the Nodes to be available.
@@ -121,16 +125,16 @@ However, this approach does carry a few drawbacks such as
Hence, it is necessary to handle capacity updates gracefully across the cluster, rather than resetting the cluster components to achieve the same outcome.
125
130
126
131
Also, given that the capability to live-resize a node exists in the Linux and Windows kernels, enabling the kubelet to be aware of the underlying changes in the node's compute capacity will mitigate any actions that are required to be made
127
132
by the Kubernetes administrator.
128
133
129
134
Node resource hot plugging proves advantageous in scenarios such as:
130
-
- Efficiently managing resource demands with a limited number of nodes by increasing the capacity of existing nodes instead of provisioning new ones.
135
+
- Efficiently managing resource demands with a limited number of nodes by increasing the capacity of existing nodes instead of provisioning new ones, which brings less overhead on the control-plane.
131
136
- The procedure of establishing new nodes is considerably more time-intensive than expanding the capabilities of current nodes.
132
137
- Improved inter-pod network latencies as the inter-node traffic can be reduced if more pods can be hosted on a single node.
133
-
- Easier to manage the cluster with fewer nodes, which brings less overhead on the control-plane
134
138
- Mitigate a few of the existing limitations/issues that are associated with a node/kubelet restart.
135
139
136
140
Implementing this KEP will empower nodes to recognize and adapt to changes in their compute configurations and allow facilitating the efficient and effective deployment of pod workloads to nodes capable of meeting the required compute demands.
@@ -347,6 +351,8 @@ is lesser than the initial capacity of the node. This is only to point at the fa
347
351
Once the node has transitioned to the NotReady state, it will be reverted to the ReadyState once when the node's capacity is reconfigured to match or exceed the last valid configuration.
348
352
In this case, valid configuration refers to a state which can either be previous hot-plug capacity or the initial capacity in case there was no history of hotplug.
349
353
354
+
In case of already running workloads and if there are not enough resources available to accommodate them post hot-unplug, the workload may tend to under perform or transition to "Pending" state or get migrated to a suitable node which meets the workload`s resource requirement.
0 commit comments