-
Notifications
You must be signed in to change notification settings - Fork 41.6k
Description
In current code, canResizePod() only checks node allocatable resource and then execute canAdmitPod(), but it doesn't check node available resource.
For container resize, if node available resources are sufficient, container should enter to Inferred status and stop following process, but with current logic, it still enter to canAdmitPod to allocate cpu and memory for container.
On the other side, in cpu‑manager‑policy=static mode, CanAdmitPod can allocate a cpuset for a Guaranteed pod and then verify the node’s available resources.
there is an potential error for guaranteed pod resize when node available resources are sufficient.
Consider the following scenario:
Test configuration
- Node total CPUs: 20
- Reserved CPUs: 0, 11
- Allocatable CPUs: 18
- Total CPU requests on the node: 1100 m
- Available CPUs on the node: 18000 m − 1100 m = 16900 m
Step 1: Create container #0 with a CPU request of 16 cores. The assigned cpuset is 1‑8,11‑18
.
Step 2: Resize container #0, increasing the request from 16 cores to 17 cores.
- CanAdmitPod
allocates a new cpuset 1‑9,11‑18
and writes it to the container’s cgroup (cpuset.cpus: 1‑9,11‑18
).
- Afterwards, the node‑resource check fails because the required 17000 m exceeds the available 16900 m, causing the resize to enter Deferred status.
The problem is that, although the resize is deferred due to insufficient node resources, the new cpuset has already been applied to the container’s cgroup. This leaves the container with a resource allocation that the node cannot actually provide, which is undesirable.
Suggested improvements:
Perform the node available resource check before canAdmitPod, so that container enter deferred status and reduce unnecessary operations in canAdmitPod, and solve the potential error in guaranteed pod resize
Metadata
Metadata
Assignees
Labels
Type
Projects
Status