Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on AEP-4016: Support for in place updates in VPA #7722

Open
maxcao13 opened this issue Jan 17, 2025 · 5 comments
Open

Questions on AEP-4016: Support for in place updates in VPA #7722

maxcao13 opened this issue Jan 17, 2025 · 5 comments

Comments

@maxcao13
Copy link

maxcao13 commented Jan 17, 2025

Which component are you using?:
/area vertical-pod-autoscaler

Now that support for feature gated InPlacePodVerticalScaling has almost reached Beta graduation in the kubernetes API, it draws nearer that the details of AEP-4016 should be clear, since early development has already started in implementing the AEP and bringing the in-place updates to the VPA: #7673

There are some questions I still have about the AEP, namely about disruptionless/disruptive updates.

InPlaceOnly and InPlaceOrRecreate will attempt to apply a disruption-free update in place if it meets at least one of the following conditions:

  • Quick OOM,
  • Outside recommended range,
  • Significant change.

InPlaceOnly and InPlaceOrRecreate will attempt to apply updates that are not disruption-free in place under the same conditions that apply to updates in the Recreate mode.

Since the conditions are different for both types of actuation, how can we actually make sure the first set of actions will actually be disruption free? I think this problem requires the need for a third resizePolicy MustNotRestart as mentioned in the AEP. However, I asked in the sig-node-inplace-pod-resize slack channel about the possibility of this, and it seems it is impossible for now from a response from Vinay:

Since the actuation of container resize is handled by the container runtime, there's no way to guarantee that the runtime won't restart the container in the context of a resize request.

There is also no way ahead of time to know if your resize request will cause a container restart either.

I believe causes some problems about the AEP-4016 details, as there is only best-effort disrupionless updates. If I am understanding correctly, we should always assume worst case disruptive updates, meaning we will have to apply in-place updates only under the same conditions that apply in Recreate mode, and that this requires amendment to the AEP.

Please let me know if I assuming incorrectly, or if there are any thoughts on this.

@adrianmoisey
Copy link
Member

/area vertical-pod-autoscaler

@adrianmoisey
Copy link
Member

However, I asked in the sig-node-inplace-pod-resize slack channel about the possibility of this, and it seems it is impossible for now from a response from Vinay:

Since the actuation of container resize is handled by the container runtime, there's no way to guarantee that the runtime won't restart the container in the context of a resize request.

Link: https://kubernetes.slack.com/archives/C06FSK01BGU/p1737064355023329

@adrianmoisey
Copy link
Member

Since the conditions are different for both types of actuation, how can we actually make sure the first set of actions will actually be disruption free?

Could you clarify which conditions are you referring to for which types of actuations? Are you referring to the conditions for Auto and Recreate vs InPlaceOnly and InPlaceOrRecreate ?

Ie, CanEvict is checked for Auto and Recreate but not for InPlaceOnly and InPlaceOrRecreate?

There is also no way ahead of time to know if your resize request will cause a container restart either.

I get the feeling that the AEP wasn't written with this in mind. I'm of the opinion that it needs to be updated.

I see this text in the KEP:

Note: NotRequired restart policy for resize does not guarantee that a container
won't be restarted. The runtime may choose to stop the container if it is unable to
apply the new resources without restarts.

Which confirms what Vinay says.

@maxcao13
Copy link
Author

maxcao13 commented Jan 20, 2025

Thanks for the response. After re-reading the AEP, I think I realized where my confusion is coming from.

The reason why I thought we should know an update would restart or not is because I thought the decision of disruption/disruption-less dictated the conditions of applying the updates.

The disruptionless actuation conditions are actually a subset of the disruptive actuation conditions. When we "attempt" a disruptionless update, it would potentially actually be disruptive. So it's possible that this disruptive update will "sneak through", regardless of the other disruptive actuation conditions were actually met (namely CanEvict being true, and long-lived pods with significant change). But according to the AEP, I guess we shouldn't really care? I guess I am just confused about if we should care.

Could you clarify which conditions are you referring to for which types of actuations? Are you referring to the conditions for Auto and Recreate vs InPlaceOnly and InPlaceOrRecreate ?

I guess I am mostly referring to InPlaceOrRecreate disruptive vs. disruptionless conditions. Disruptionless conditoins are:
Quick OOM, Outside recommended range, Significant change and disruptive is the same as Auto/Recreate which is basically the same as disruptionless except with CanEvict == true and the significant change has to be on a "long-lived" pod.

EDIT: Disruptive updates can apparently also ignore pod disruption budgets, so that is another point of contention on whether the kubelet or the VPA should potentially handle this.

@maxcao13
Copy link
Author

maxcao13 commented Jan 20, 2025

And another question is related to the test plan:

Test Plan
The following test scenarios will be added to e2e tests. Both InPlaceOnly and InPlaceOrRecreate modes will be tested and they should behave the same:

  • Disruption-free in-place update applied to all containers of a pod (request in recommendation bounds).

Does this mean we have to guarantee that this update is disruption free in the test (cause we can't)? Or are we just simply checking that it wasn't updated through eviction, and that the actuation conditions for disruption-free were met (meaning that containers could've restarted!)

And same thing with this test:

Disruptive in-place update applied to all containers of a pod (request out of recommendation bounds).

Does this just mean that all the containers will have RestartContainer resize policies and we have to check all containers actually restarted? And why does it matter if the request was out of recommendation bounds? This condition is in both disruption-free vs disruptionless: are we actually supposed to write a test matrix with every condition?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants