Questions on AEP-4016: Support for in place updates in VPA #7722

maxcao13 · 2025-01-17T20:08:03Z

Which component are you using?:
/area vertical-pod-autoscaler

Now that support for feature gated InPlacePodVerticalScaling has almost reached Beta graduation in the kubernetes API, it draws nearer that the details of AEP-4016 should be clear, since early development has already started in implementing the AEP and bringing the in-place updates to the VPA: #7673

There are some questions I still have about the AEP, namely about disruptionless/disruptive updates.

InPlaceOnly and InPlaceOrRecreate will attempt to apply a disruption-free update in place if it meets at least one of the following conditions:

Quick OOM,

Outside recommended range,

Significant change.

InPlaceOnly and InPlaceOrRecreate will attempt to apply updates that are not disruption-free in place under the same conditions that apply to updates in the Recreate mode.

Since the conditions are different for both types of actuation, how can we actually make sure the first set of actions will actually be disruption free? I think this problem requires the need for a third resizePolicy MustNotRestart as mentioned in the AEP. However, I asked in the sig-node-inplace-pod-resize slack channel about the possibility of this, and it seems it is impossible for now from a response from Vinay:

Since the actuation of container resize is handled by the container runtime, there's no way to guarantee that the runtime won't restart the container in the context of a resize request.

There is also no way ahead of time to know if your resize request will cause a container restart either.

I believe causes some problems about the AEP-4016 details, as there is only best-effort disrupionless updates. If I am understanding correctly, we should always assume worst case disruptive updates, meaning we will have to apply in-place updates only under the same conditions that apply in Recreate mode, and that this requires amendment to the AEP.

Please let me know if I assuming incorrectly, or if there are any thoughts on this.

The text was updated successfully, but these errors were encountered:

adrianmoisey · 2025-01-18T13:57:13Z

/area vertical-pod-autoscaler

adrianmoisey · 2025-01-19T13:20:05Z

However, I asked in the sig-node-inplace-pod-resize slack channel about the possibility of this, and it seems it is impossible for now from a response from Vinay:

Since the actuation of container resize is handled by the container runtime, there's no way to guarantee that the runtime won't restart the container in the context of a resize request.

Link: https://kubernetes.slack.com/archives/C06FSK01BGU/p1737064355023329

adrianmoisey · 2025-01-19T13:40:43Z

Since the conditions are different for both types of actuation, how can we actually make sure the first set of actions will actually be disruption free?

Could you clarify which conditions are you referring to for which types of actuations? Are you referring to the conditions for Auto and Recreate vs InPlaceOnly and InPlaceOrRecreate ?

Ie, CanEvict is checked for Auto and Recreate but not for InPlaceOnly and InPlaceOrRecreate?

There is also no way ahead of time to know if your resize request will cause a container restart either.

I get the feeling that the AEP wasn't written with this in mind. I'm of the opinion that it needs to be updated.

I see this text in the KEP:

Note: NotRequired restart policy for resize does not guarantee that a container
won't be restarted. The runtime may choose to stop the container if it is unable to
apply the new resources without restarts.

Which confirms what Vinay says.

maxcao13 · 2025-01-20T06:49:24Z

Thanks for the response. After re-reading the AEP, I think I realized where my confusion is coming from.

The reason why I thought we should know an update would restart or not is because I thought the decision of disruption/disruption-less dictated the conditions of applying the updates.

The disruptionless actuation conditions are actually a subset of the disruptive actuation conditions. When we "attempt" a disruptionless update, it would potentially actually be disruptive. So it's possible that this disruptive update will "sneak through", regardless of the other disruptive actuation conditions were actually met (namely CanEvict being true, and long-lived pods with significant change). But according to the AEP, I guess we shouldn't really care? I guess I am just confused about if we should care.

Could you clarify which conditions are you referring to for which types of actuations? Are you referring to the conditions for Auto and Recreate vs InPlaceOnly and InPlaceOrRecreate ?

I guess I am mostly referring to InPlaceOrRecreate disruptive vs. disruptionless conditions. Disruptionless conditoins are:
Quick OOM, Outside recommended range, Significant change and disruptive is the same as Auto/Recreate which is basically the same as disruptionless except with CanEvict == true and the significant change has to be on a "long-lived" pod.

EDIT: Disruptive updates can apparently also ignore pod disruption budgets, so that is another point of contention on whether the kubelet or the VPA should potentially handle this.

maxcao13 · 2025-01-20T06:59:57Z

And another question is related to the test plan:

Test Plan
The following test scenarios will be added to e2e tests. Both InPlaceOnly and InPlaceOrRecreate modes will be tested and they should behave the same:

Disruption-free in-place update applied to all containers of a pod (request in recommendation bounds).

Does this mean we have to guarantee that this update is disruption free in the test (cause we can't)? Or are we just simply checking that it wasn't updated through eviction, and that the actuation conditions for disruption-free were met (meaning that containers could've restarted!)

And same thing with this test:

Disruptive in-place update applied to all containers of a pod (request out of recommendation bounds).

Does this just mean that all the containers will have RestartContainer resize policies and we have to check all containers actually restarted? And why does it matter if the request was out of recommendation bounds? This condition is in both disruption-free vs disruptionless: are we actually supposed to write a test matrix with every condition?

k8s-ci-robot added the area/vertical-pod-autoscaler label Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on AEP-4016: Support for in place updates in VPA #7722

Questions on AEP-4016: Support for in place updates in VPA #7722

maxcao13 commented Jan 17, 2025 •

edited

Loading

adrianmoisey commented Jan 18, 2025

adrianmoisey commented Jan 19, 2025

adrianmoisey commented Jan 19, 2025

maxcao13 commented Jan 20, 2025 •

edited

Loading

maxcao13 commented Jan 20, 2025 •

edited

Loading

Questions on AEP-4016: Support for in place updates in VPA #7722

Questions on AEP-4016: Support for in place updates in VPA #7722

Comments

maxcao13 commented Jan 17, 2025 • edited Loading

adrianmoisey commented Jan 18, 2025

adrianmoisey commented Jan 19, 2025

adrianmoisey commented Jan 19, 2025

maxcao13 commented Jan 20, 2025 • edited Loading

maxcao13 commented Jan 20, 2025 • edited Loading

maxcao13 commented Jan 17, 2025 •

edited

Loading

maxcao13 commented Jan 20, 2025 •

edited

Loading

maxcao13 commented Jan 20, 2025 •

edited

Loading