Skip to content

Conversation

esotsal
Copy link

@esotsal esotsal commented Sep 21, 2025

  • One-line PR description: Create new KEP 5554: In place update pod resources alongside static cpu manager policy
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 21, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 21, 2025
@esotsal esotsal force-pushed the ippvs-alognside-static-cpu-policy-KEP branch from 4c5c393 to 1240d58 Compare September 21, 2025 18:24
@esotsal esotsal changed the title KEP 5554: In place update pod resources alongside static cpu manager policy KEP 5554: In place update pod resources alongside static cpu manager policy KEP creation Sep 21, 2025
@esotsal esotsal changed the title KEP 5554: In place update pod resources alongside static cpu manager policy KEP creation [WIP] KEP 5554: In place update pod resources alongside static cpu manager policy KEP creation Sep 21, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 21, 2025
@esotsal
Copy link
Author

esotsal commented Sep 21, 2025

@k8s-ci-robot
Copy link
Contributor

@esotsal: GitHub didn't allow me to request PR reviews from the following users: Chunxia202410.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @natasha41575 @tallclair @pravk03 @Chunxia202410 @ffromani

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@esotsal esotsal force-pushed the ippvs-alognside-static-cpu-policy-KEP branch from 1240d58 to 8973b16 Compare September 26, 2025 10:59
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 26, 2025
@esotsal esotsal force-pushed the ippvs-alognside-static-cpu-policy-KEP branch 10 times, most recently from 24bfb5c to a6c437b Compare September 26, 2025 17:19
@esotsal esotsal force-pushed the ippvs-alognside-static-cpu-policy-KEP branch 10 times, most recently from 9812852 to b3065a4 Compare October 8, 2025 14:23
@esotsal
Copy link
Author

esotsal commented Oct 8, 2025

/assign @deads2k For PRR

It will make things easier for the PRR reviewer if we try to get as many of the open comments addressed as possible prior to PRR freeze

Thanks, i tried to get as many as i could. The comments about Deferred didn't had time to complete, in general i agree with the comment, just need more time to describe it properly, i don't think it will be ready by tomorrow.

@esotsal esotsal force-pushed the ippvs-alognside-static-cpu-policy-KEP branch from b3065a4 to 2608740 Compare October 9, 2025 13:21
automations, so be extremely careful here.
-->

No.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would resize requests that previously failed start succeeding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, that one is my bad: #5555 (comment). I didn't realize that resize requests that previously failed starting to succeeded was what was meant by default behavior changing

@esotsal please feel free to change it back to what you had originally

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! Acknowledged will update accordingly within today’s business day.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, that one is my bad: #5555 (comment). I didn't realize that resize requests that previously failed starting to succeeded was what was meant by default behavior changing

It's generally a new field, so no one can have latent data that suddenly starts being acted upon. Since this field already exists and could have data not being acted upon, when the kubelet restarts it could suddenly start being acted upon, so it should be noted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, done. PTAL

@deads2k
Copy link
Contributor

deads2k commented Oct 13, 2025

some PRR questions/clarifications, but the PRR looks good otherwise.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general direction LGTM.
Mostly inline clarifications questions.
I don't think we should detail so much the implementation in the KEP, and I think is fair to refine the implementation at code review stage once the KEP is approved.
That said, I'm supportive overall, need to review the metrics and the observability side, still pending.
I'm confident I can LGTM once the comments are addressed.

* Topology manager policies can be adhered to.
* CPUs allocated upon creation of Guaranteed QoS pods are maintained.

This proposal will leverage [KEP-5328 (Node Declared Features framework)](https://github.com/pravk03/enhancements/tree/9de1ba74b4b71883f0ceabf8580d5d817bda36d4/keps/sig-node/5328-node-declared-features) ( if implemented ), to overcome feature rollout challenges with [Version Skew Strategy](#version-skew-strategy):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm completely fine with this approach, but this would mean that we have two keps dependent on each other to be delivered in the same cycle. It can be nontrivial to get through. I'm just highlighting a challenge here, again I'm fine with this approach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, i think it is worth trying since it will make cleaner the implementation.

Copy link
Author

@esotsal esotsal Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this KEP is postponed , I think your concerns thing’s being stressed between the two KEPs should be relaxed now. Please let me know what you think if you are ok to resolve this conversation.

* must keep “promised” CPUs of a running container of a Guaranteed Pod CPUs.
+ With the term “promised” we mean the CPUs allocated upon creation of the Guaranteed Pod checkpointed in CPU Managers checkpoint file, please refer to [Promised CPUs checkpoint](#promised-cpus-checkpoint) for more details of the proposed implementation using local storage.
* attempt of allocating additional CPUs should adhere to the combination of Topology/Memory/CPU/Device/kubelet reservation policies
+ All 40 possible combinations should work
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exploring the full combination matrix is surely the most comprehensive option but I'm really fairly confident that strict-cpu-reservation should be safely elided with some targeted sentinel e2e tests. This alone halves the test combinations.


### What happens on CPU Resize failures

When the topology manager cannot generate a topology hint which satisfies the topology manager policy, the pod resize is marked as [Deferred](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources#resize-status). This means that while the node theoretically has enough CPU for the resize, it's not currently possible but can be re-evaluated later, potentially becoming feasible as pod churn frees up capacity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed important to doublecheck but at glance the topology manager admit handler should reasonnably ready thanks to its fundamental property of being stateless. I think we will maybe need to review/refine the error reporting, but the basic flow should be ok

Consider including folks who also work outside the SIG or subproject.
-->

Bugs in handling CPU resizing can result in incorrect CPU pinning affecting performance of the pod being resized and also the other pods on the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resizes, especially downsize, cause resource fragmentation which will make future resource requests harder to accomodate, including new admissions.
(unrelated to risks?) we may need to review the allocation algorithm in general to better tolerate fragmentation. Random thought example: defaulting to distributed allocation (vs current default packed allocation) MAY make resizes easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the risk of a Pod remaining in a deferred state indefinitely in scenarios where topology or policy constraints would never allow the allocation, or provide a mitigation strategy for such cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the risk of a Pod remaining in a deferred state indefinitely in scenarios where topology or policy constraints would never allow the allocation, or provide a mitigation strategy for such cases.

Such resizes should not be marked as deferred, they should be marked as infeasible: #5555 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, good point. Done, PTAL


Please refer to https://github.com/kubernetes/kubernetes/pull/129719/files# for details.

With the goal the Implementation to be specific to cpu manager implementation modifies the following files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is welcome but too low level. You don't need to call out the source code file, and IMHO this backfires a little because ties the KEP design, which should be timeless as possible, to a certain state of the source code more than necessary. If we are short in time, and/or if other reviewers are not bothered, you can keep it like this, but please consider a more high-level/less coupled implementation description for the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the above suggestion. KEPs are ideally more stable/long-lived than code. If the code layout changes the KEP should not become stale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 that laying out the exact source code structure file-by-file is likely to become stale. To offer a little more guidance, the more clear "implementation" sections that I've seen usually describe the behavior of how the request moves through the different components (in this case allocation manager, topology/cpu manager etc), and high level descriptions of what each of those components do and their interactions with each other.

But agree that this is nonblocking and maybe can be fixed in the beta update if there is not enough time now

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, i think we are short in time, i will try to update

Copy link
Author

@esotsal esotsal Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update KEP , describing the implementation behavior and remove links to source code Since we have more time now , will keep this conversation unresolved and ping you when done to take another look.

type getPromisedCPUSetError struct {
PodUID string
ContainerName string
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this is too low level to be codified in the KEP. we should leave ourselves more freedom at the implementation stage to how actually code the algorithm and the design we outline in the KEP.
As general direction seem sensible, but a separate code review will be needed later anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, , i think we are short in time, i will try to update

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update KEP , describing the implementation behavior and remove links to source code Since we have more time now , will keep this conversation unresolved and ping you when done to take another look


For example ( not fully decided, to be concluded in Beta )

"cpu_manager_alognside_ippvs_pinning_errors_total".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid acronyms in metrics. What about for example

cpu_manager_inplace_resize_pinning_errors_total

?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, done, as per Natashas comment below, i have removed metrics test at this stage. We can take this in beta updates later since metrics will be graduation criteria.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finished reviews, PTAL to one comment about using cpu_manager_state (we should not). Other than that, my previous summary applies.


Usage metrics for better observability of this KEP will be implemented and could be used.

For example ( not fully decided, to be concluded in Beta )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine deferring this decision to beta or to alpha2 (and PRR seems to be fine as well)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, done. PTAL

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please @ffromani check if this applies also for #5555 (comment) ?


Please refer to https://github.com/kubernetes/kubernetes/pull/129719/files# for details.

With the goal the Implementation to be specific to cpu manager implementation modifies the following files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the above suggestion. KEPs are ideally more stable/long-lived than code. If the code layout changes the KEP should not become stale.

* pull-kubernetes-node-kubelet-serial-cpu-manager
* pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2

At the time of writing please check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to capture more detail around the scenarios that will be tested. For instance:

  1. Successful resize when CPU request/limit is updated from 2 to 4.
  2. Resize failure when CPU < promised CPUs.

Essentially, these could mirror the cases you’ve illustrated in the diagram under the “Use Cases for CPU Resize” section.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, done but will try to update more as per Natashas comments as well below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity The link to the running tests from the existing PR which I have added , actually do plenty of tests 73 tests have been added ( 193 total from 123 ) for this KEP ( will be more ). Some being redoes existing IPRR e2e tests alongside CPU Manager ( to ensure it doesn’t break existing functionality ) , multi pod container scale up, scale down , negative tests with burstable / guaranteed, promised , testing Infeasible resizes ie asking large number of CPUs and Deferred . In my plans would be to add tests with multiple pod , creating deleting in later phases . I will follow KEP template on this and add them described with reasoning.

I didn’t had time to explicitly write all current 73 of them , i assumed a link to the proof of the running tests would be sufficient. My fault . Now since time issue is solved postponing this KEP to v1.36 will update in detail.

Comment on lines +45 to +49
feature-gates:
- name: InPlacePodVerticalScalingExclusiveCPUs
components:
- kubelet
disable-supported: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if we need to capture feature gates that InPlacePodVerticalScalingExclusiveCPUs depends on e.g. feature gates corresponding to in-place Resource Resize, CPU Manager, Topology manager here for the purpose of tracking.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate, I don’t understand this comment. Can you please add a suggestion perhaps ?


### What happens on CPU Resize failures

When the topology manager cannot generate a topology hint which satisfies the topology manager policy, the pod resize is marked as [Deferred](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources#resize-status). This means that while the node theoretically has enough CPU for the resize, it's not currently possible but can be re-evaluated later, potentially becoming feasible as pod churn frees up capacity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking the Pod as deferred makes sense when the request could eventually be satisfied on the node once resources free up. But how should we handle cases where topology or policy constraints would never permit the allocation (e.g., multiple NUMA zones or a CPU request exceeding a single NUMA node’s capacity)?

Copy link
Contributor

@natasha41575 natasha41575 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sort of what I had in the back of my mind when I asked this question: #5555 (comment)

For Deferred resizes, the KEP should provide examples of the ways that resources can be freed up later so that the resize can succeed. Deferred is not an appropriate condition to use if we know that the resize will never succeed.

For nonrecoverable constraints, the KEP should enumerate them and explicitly state that they are marked as Infeasible. Infeasible resizes are never retried.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Thinking more about it, deferred seems most likely when a pod is Terminating, so we know these resources would be soon be free and taking those into account THEN we can satisfy an allocation.
But the topology manager currently doesn't do speculative allocation (what-if scenarios), it only operates on a snapshot in time of the pod state.

I for myself I'm not against exploring this path, but this alone is probably KEP worthy, because of scope.
If we consider only a node state snapshot, then the feasibility state is binary yes/no, with no "maybe" (deferred) I can think of. This is a result of the fact that the resource managers DO NOT do resource compaction, because it would violate the pinning/static guarantees.

Thinking about the future, is however possible to explore more integration with the eviction manager and/or with the descheduler to get controller-driven resource compaction.

Circling back to the original point, I think that it would be beneficial to point more clearly on which scenarios we would expect Deferred status, as it seems harder to achieve.

Copy link
Author

@esotsal esotsal Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , I agree I will add more details on all possible cases I can think off , with a suggestion of status ( Infeasible / Deferred ) with reasoning of why.

Just want to highlight , because I get the impression that there is a feeling that this KEP is complicated, it is not so complicated as it seems, I believe.

I understand the concerns , i would recommend to check the accompanying PR and the tests and see that actually things are less complicated.

I gave it a thought for the Deffered present future and I have an idea how to test this on the PR with a multi pod test. I will add the test code to the PR and describe in the KEP tests section. Hopefully this will make things clearer.

I am relieved that this point was raised before the v1.35 deadline and that we all agreed upon postponing this KEP to v1.36.

Copy link
Contributor

@natasha41575 natasha41575 Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly prefer what @ffromani brought up as an option below: #5555 (comment)

Specifically:

I think that IF we limit the scope of this work to only cpumanager enablement, THEN we will have mostly Unfeasible responses which can be turned into Deferred later once topology manager enhancement is done. This is my own reading, I don't know if this is the intended/desired outcome longterm.

To me this sounds like a much better path forward. I'd rather the KEP just say that all errors from topology manager will be treated as Infeasible. We can change them to Deferred later as a follow-up if there is a use case.

Just want to highlight , because I get the impression that there is a feeling that this KEP is complicated, it is not so complicated as it seems, I believe.

It is complicated, but only because you have added a proposal for a new state for Deferred. Almost all of my concerns go away if we just stick to Infeasible rather than introducing new Deferred modes.

I gave it a thought for the Deffered present future and I have an idea how to test this on the PR with a multi pod test. I will add the test code to the PR and describe in the KEP tests section. Hopefully this will make things clearer.

Having tests doesn't address the concerns I brought up in #5555 (comment). The KEP is still not explaining or designing reasonable behavior that can work well with the higher level controllers that this feature is designed to empower. If these concerns about Deferred are not addressed nor is Deferred removed from the scope of the KEP, I am not sure that I feel confident that this will make it to 1.36 either. If we keep Deferred in the scope, we will have to consider how that affects capacity requests.

Copy link
Author

@esotsal esotsal Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KEP is still not explaining or designing reasonable behavior that can work well with the higher level controllers that this feature is designed to empower. If these concerns about Deferred are not addressed nor is Deferred removed from the scope of the KEP, I am not sure that I feel confident that this will make it to 1.36 either. If we keep Deferred in the scope, we will have to consider how that affects capacity requests.

+1 , concerns will be addressed and KEP will be updated taking into account higher level controllers that this feature is designed to empower including impact in capacity requests. Thanks @swatisehgal , @pravk03 @natasha41575 and @ffromani for creating awareness on this important requirement, which was missed from my side.


Please refer to [Prerequisite testing updates](#prerequisite-testing-updates) for more details.

![alt_text](./use_cases_for_cpu_resize.png "Use Cases for CPU Resize")
Copy link
Contributor

@swatisehgal swatisehgal Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very helpful and has clarified the proposal for me. It could be useful to include a table of “expected outcomes” for common cases like: small CPU increase, large CPU increase, CPU request decrease, and static policy scenarios with multiple NUMA zones. It would also be valuable to document what guarantees (or lack thereof) exist for existing workloads.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, will try to create such a table.

Consider including folks who also work outside the SIG or subproject.
-->

Bugs in handling CPU resizing can result in incorrect CPU pinning affecting performance of the pod being resized and also the other pods on the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the risk of a Pod remaining in a deferred state indefinitely in scenarios where topology or policy constraints would never allow the allocation, or provide a mitigation strategy for such cases.


* Ensure that the QoS class of the pod does not change during resize.
* Preserve the properties of the configured Topology Manager, CPU Manager during resize.
* Support CPU limit increase without container restart keeping originally configured CPU values (during pod creation on the Node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth mentioning here, or elsewhere in the KEP, that since we are discussing the CPU Manager static policy for Guaranteed QoS Pods, CPU request equals limit, and this equality should be maintained when the Pod is resized.

Copy link
Contributor

@natasha41575 natasha41575 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps clarify in goals or nongoals that we want parity in all other limitations that IPPR currently has

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack done

For the comment about the risk, i think this use case is applicable also to general In place update pod resources. Today in such case mitigation is resize to be reverted by the user, as far as i know. Added in the risks section.

PTAL

For example ( not fully decided, to be concluded in Beta )

- "cpu_manager_alognside_ippvs_pinning_errors_total".
- "cpu_manager_alognside_ippvs_requests_total".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to leave out the example metrics for now (here and elsewhere). Metrics will need review from sig-instrumentation and I think if you leave it here now, it might distract the conversation from the things that need to be nailed down for alpha.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, done. PTAL


Please refer to https://github.com/kubernetes/kubernetes/pull/129719/files# for details.

With the goal the Implementation to be specific to cpu manager implementation modifies the following files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 that laying out the exact source code structure file-by-file is likely to become stale. To offer a little more guidance, the more clear "implementation" sections that I've seen usually describe the behavior of how the request moves through the different components (in this case allocation manager, topology/cpu manager etc), and high level descriptions of what each of those components do and their interactions with each other.

But agree that this is nonblocking and maybe can be fixed in the beta update if there is not enough time now

- a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html)
-->

- N/A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide a brief justification (e.g. all cases are covered by unit and e2e tests)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, done. PTAL


For Static CPU Management Policy support the following end to end tests must be introduced or updated :

* Under pkg/kubelet/test/e2e_node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not sufficient (and IMO not necessary) to list where the e2e tests will be written and running. The e2e test plan should describe what behavior we are testing. It can be at a high level, but we need to be able to cross check desired coverage compared to actual coverage when it comes time for beta / ga promotion.

I think the pod level resource kep is a good example for something to start.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, i will take a look and try to update. I think we are short in time but will give it a try.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity please check also my response #5555 (comment)

The particularity of this KEP is that it was first suggested to be a fix so lot of discussion about tests were done already in the PR.

For historical reasons want to remind that IPPR e2e node tests started from this PR , after refactored and also the need was highlighted to community to refactor CPU manager tests, so there is good reason why the location is mentioned . It is because is the only test location which can perform this KEPs tests.

Copy link
Author

@esotsal esotsal Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having now more time after the decision to postpone this KEP to v1.36 , i will update the contents for this section to improve clarity. Will keep this comment unresolved and ping you when ready for review.

Comment on lines 702 to 647
* User feedback (ideally from at least two distinct users) is green,
* 2 examples of real-world usage
* 2 installs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are targeting 1.36 for beta and also setting these as beta graduation requirements. I am a little skeptical that we are going to be able to get 2 examples of real-world usage and "allow time for feedback" (as stated in the line below) for a feature that is planned to be in alpha-stage for just a couple of months before we have to make the decision to promote to beta. I feel like we are setting ourselves up for failure.

@SergeyKanzhelev @tallclair do you have general sig node guidance for this?

@esotsal esotsal force-pushed the ippvs-alognside-static-cpu-policy-KEP branch from 2608740 to af67a05 Compare October 14, 2025 18:05
Copy link
Contributor

@natasha41575 natasha41575 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still many open / unresolved discussion from my initial review from two weeks ago; please make sure to address those too.

The most important ones:


Resource fragmentation could potentially caused, after pod resizes, especially downsize. Such fragementation could result that future resource requests could be harder to accommodate, including new admissions. In such case a possible mitigation would be in the future to review the allocation algorithm in general to better tolerate fragmentation.

Pod could remain in a deferred state indefinitely in scenarios where topology or policy constraints would never allow the allocation, and it is not possible to predict this in order to set in an Infeasible state. In such case, the condition's `message` will include details of why the resize has not been admitted a possible mitigation would be the resize attempt to be reverted by the user.
Copy link
Contributor

@natasha41575 natasha41575 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not possible to predict this? @swatisehgal gave a few scenarios in her comment #5555 (comment), what makes it impossible for the topology / cpu manager to detect such scenarios?

For such scenarios, we should explicitly list the ones we can think of and if (and this is a big if) it is impossible to make the distinction, we should explain why the limitation is there.

cc @swatisehgal @ffromani @tallclair

I want to avoid setting the precedent of just using "Deferred" as the backup for all potential failures in scenarios where we know the resize can't succeed. "Deferred" sends a message to the user and autoscalers built on top of IPPR (such as VPA) that the resize can be retried if room is made. Autoscalers are then going to evict or scale down other pods to make room for the resize, but the resize is never going to succeed. This is hugely disruptive and undesirable.

I don't want to end up in a scenario where VPA etc have to learn how to distinguish between Deferred (impossible) or Deferred (can succeed later). That's what Infeasible is for. A human user having to step in and revert the resize is not a reasonable mitigation IMO, because it doesn't work at scale.

Copy link
Author

@esotsal esotsal Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Topology manager involves memory manager , device on top of manager as well. This KEP intention is only CPU, if we want to make such an effort then it will require topology manager modifications which as agreed we’re out of scope, I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something, but I am not sure that I follow distinguishing between deferred and infeasible cases means we'd be affecting the memory manager side of things? We would still have a check to make sure that resizes that resize memory along with static memory manager policy are marked infeasible.

if we want to make such an effort then it will require topology manager modifications which as agreed we’re out of scope

When / where did we decide topology manager modifications would be out of scope? That seems like an unrealistic expectation for a KEP that is proposing changes to its behavior...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swatisehgal use case is applicable possible also as it is today’s InPlacePodVerticalScaling functionality. without exclusive CPUs It could be possible to happen and the risk exists to stick to a deferred state. Reading 1287 KEP my understanding is that only way to get out of this is to revert the resize , thus it is up to the user or the autoscaler algorithm to handle this. Same applies to this KEP , I believe.

Copy link
Contributor

@pravk03 pravk03 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe all the Infeasible cases can be statically determined without requiring changes in the topology manager itself ?.

Some cases that I could think of.

  1. Topology manager policy is single-numa-node and we are requesting more CPUs than what a NUMA node contains.
  2. CPU Decrease below initial allocation (< promised_CPUs).
  3. CPU manager policy option is full-pcpus-only, HT is enabled and we are scaling up/down to odd number of CPUs.

To me, it feels like there aren't many Infeasible cases and we could do static checks in canResizePod( ) before attempting hint generation.

We can incorporate the similar checks for memory when we introduce IPPR support with static memory policy.

Copy link
Contributor

@natasha41575 natasha41575 Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ffromani for your comment.

I have no concerns if topology manager wants to output Infeasible state so that we limit the change only to cpu manager, and I am happy to think about Deferred as a future enhancement. My concern was that the KEP as written stated that if topology manager could not generate a hint, the resize would be marked as deferred - which would break the contract we have with higher level controllers today.

Should we wait with this KEP until you have tackled this then ? What do you think @tallclair

This is only relevant when we want topology manager to be able to defer a resize. For what @ffromani suggested above:

I think it depends on how we want to partition the work. I've interpreted this KEP as focusing on making the cpumanager able to resize which, alongside the other resource managers, is an enabler to a further extension to topology manager like we are discussing.

With that in mind, extending the cpumanager alone with no changes to topology manager is likely to produce only Unfeasible states (IOW: I can't easily picture Deferred states without topology manager enhancements)

I think this is completely reasonable as a first step - only produce Infeasible states when the resize cannot be performed. I don't see capacity requests as needing to block this. But what @ffromani said here is very different from what is written in the KEP at the moment, and is very different from your responses to my questions regarding deferred states thus far.

VPA or user should take care how to act upon repeated deferred resizes. If there are plans for IPPR to handle this , by introducing kind of max retries or any other similar method then a KEP itself is needed for this, IPPR planned for GA in v1.35 is not build for this.

I completely disagree with this. IPPR exists almost solely to be an enabler for autoscalers, with VPA being the primary driving stakeholder. If you read the discussion on #5562, you will see that the main blocking question for GA graduation was "Have we done enough to support VPA?"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of being clear bordering obvious, I want to emphasize that this is my understanding of this KEP which comes from reading the content and the comments, and my own personal assessment. I may be missing some key details. Thanks for bringing to my attention these subtopics.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since deferred is discussed in several thread in this KEP , please check my answer also at #5555 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I am relieved that this point was raised before the v1.35 deadline and that we all agreed upon postponing this KEP to v1.36.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this handled at pod creation time? Would these infeasible states cause the pod to be rejected? Or would the Kubelet just make some best-effort attempt at alignment?

@esotsal
Copy link
Author

esotsal commented Oct 14, 2025

Thanks for the reviews , i got the last hours plenty which I don’t think I have time to answer all / update the KEP within 24hours ie 1.35 deadline 16th October. Please indicate if you believe your comments are blocking for 1.35 , to move the milestone to 1.36 for alpha.

@natasha41575
Copy link
Contributor

Thanks for the reviews , i got the last hours plenty which I don’t think I have time to answer all / update the KEP within 24hours ie 1.35 deadline 16th October. Please indicate if you believe your comments are blocking for 1.35 , to move the milestone to 1.36 for alpha.

If by this, you mean you don’t intend to address the open comments for 1.35 deadline, then I’ll just share that I don’t feel good about giving this KEP my LGTM as it currently is. Of course @ffromani’s has the final judgment on the details regarding cpu / topology manager side of things, but on the IPPR side, there are still many open discussions.

The biggest one, and the one that I feel should be blocking, is around deferred / infeasible resize handling:

Deferred and Infeasible mean something, and are acted on by controllers differently. It is important that we at least try to get it right, or document the limitations. What I was hoping that the outcome of these comments would be:

  • Better explanation of why the “Deferred” cases are deferred, how they might succeed in the future, and how the kubelet should know when is the right time to try again.
  • A list of scenarios where we know the resize would never succeed, with those resizes being marked as “Infeasible” (or a sufficient explanation as to why we would not be able to distinguish infeasible cases from deferred cases).
  • More clarity on what your “error types” are for, because it is not obvious how they fit into the design.

Besides that, there are a few other open discussions, listing just a few of them here:

Maybe individually some of these don’t need to block, but all put together it gives me a general feeling that not everything has been thought through sufficiently.

@esotsal
Copy link
Author

esotsal commented Oct 15, 2025

If by this, you mean you don’t intend to address the open comments for 1.35 deadline,

Yes , this is what ii mean. I am short in time to answer all those comments by tomorrow, i anticipate that an exception request would not be sufficient either. We share the same opinion , let’s postpone for v1.36 , to have time to answer the comments and improve this KEP. Informed assigned KEP wrangler #5554 (comment) about this.

@ffromani
Copy link
Contributor

ffromani commented Oct 15, 2025

Thanks for the reviews , i got the last hours plenty which I don’t think I have time to answer all / update the KEP within 24hours ie 1.35 deadline 16th October. Please indicate if you believe your comments are blocking for 1.35 , to move the milestone to 1.36 for alpha.

If by this, you mean you don’t intend to address the open comments for 1.35 deadline, then I’ll just share that I don’t feel good about giving this KEP my LGTM as it currently is. Of course @ffromani’s has the final judgment on the details regarding cpu / topology manager side of things, but on the IPPR side, there are still many open discussions.

Considered the last comments in the last few hours I agree is better to push to 1.36 and clarify the intended use cases and if we need a scope change.
[EDIT] but if there's a consensus forming around how to move forward and we "just" need to iron some details (but it doesn't seem the case), I'm also willing to push through a deadline extension and help from my side.

@esotsal
Copy link
Author

esotsal commented Oct 16, 2025

[EDIT] but if there's a consensus forming around how to move forward and we "just" need to iron some details (but it doesn't seem the case), I'm also willing to push through a deadline extension and help from my side.

Thanks @ffromani , i think there is a consensus that we should think of it more thoroughly and update the KEP to improve clarity. I think one release postpone is needed and worth it, I don’t see it as an analysis paralysis situation, we are just in ❤️ with the problem not the solution, as we should 😁.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants