Skip to content

Commit 5b27492

Browse files
committed
Address review comments
Co-authored-by: kishen-v <[email protected]>
1 parent 157a1ca commit 5b27492

File tree

1 file changed

+8
-2
lines changed
  • keps/sig-node/3953-node-resource-hot-plug

1 file changed

+8
-2
lines changed

keps/sig-node/3953-node-resource-hot-plug/README.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,10 @@ to be force migrated to a different node, causing a temporary spike in the CPU/M
111111
With the current state of implementation in the Kubernetes realm, the available workarounds to allow the cluster to be aware of the changes in the cluster's capacity is by
112112
restarting the node or at-least restarting the kubelet, which does not have a certain set of best-practices to follow.
113113

114+
One of the major drawback of current system is around timing and coupling the kubelet restart to detect the available resources. For example, in cloud environment, if a node admin can scale up the node,
115+
it may hard to time the completion of resize operation based on the added capacity and following the restart of kubelet.
116+
Also there may exist few APIs provided by cloud SDK to perform resize operation coupling it with kubelet restart is not seamless.
117+
114118
However, this approach does carry a few drawbacks such as
115119
- Introducing a downtime for the existing/to-be-scheduled workloads on the cluster until the node is available.
116120
- For baremetal clusters it involves significant amount time for the Nodes to be available.
@@ -121,16 +125,16 @@ However, this approach does carry a few drawbacks such as
121125
- https://github.com/kubernetes/kubernetes/issues/125579
122126
- https://github.com/kubernetes/kubernetes/issues/127793
123127

128+
124129
Hence, it is necessary to handle capacity updates gracefully across the cluster, rather than resetting the cluster components to achieve the same outcome.
125130

126131
Also, given that the capability to live-resize a node exists in the Linux and Windows kernels, enabling the kubelet to be aware of the underlying changes in the node's compute capacity will mitigate any actions that are required to be made
127132
by the Kubernetes administrator.
128133

129134
Node resource hot plugging proves advantageous in scenarios such as:
130-
- Efficiently managing resource demands with a limited number of nodes by increasing the capacity of existing nodes instead of provisioning new ones.
135+
- Efficiently managing resource demands with a limited number of nodes by increasing the capacity of existing nodes instead of provisioning new ones, which brings less overhead on the control-plane.
131136
- The procedure of establishing new nodes is considerably more time-intensive than expanding the capabilities of current nodes.
132137
- Improved inter-pod network latencies as the inter-node traffic can be reduced if more pods can be hosted on a single node.
133-
- Easier to manage the cluster with fewer nodes, which brings less overhead on the control-plane
134138
- Mitigate a few of the existing limitations/issues that are associated with a node/kubelet restart.
135139

136140
Implementing this KEP will empower nodes to recognize and adapt to changes in their compute configurations and allow facilitating the efficient and effective deployment of pod workloads to nodes capable of meeting the required compute demands.
@@ -347,6 +351,8 @@ is lesser than the initial capacity of the node. This is only to point at the fa
347351
Once the node has transitioned to the NotReady state, it will be reverted to the ReadyState once when the node's capacity is reconfigured to match or exceed the last valid configuration.
348352
In this case, valid configuration refers to a state which can either be previous hot-plug capacity or the initial capacity in case there was no history of hotplug.
349353

354+
In case of already running workloads and if there are not enough resources available to accommodate them post hot-unplug, the workload may tend to under perform or transition to "Pending" state or get migrated to a suitable node which meets the workload`s resource requirement.
355+
350356
#### Flow Control
351357

352358
```

0 commit comments

Comments
 (0)