Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA DRA: process unschedulable Pods through ClusterSnapshot #7686

Open
towca opened this issue Jan 9, 2025 · 0 comments
Open

CA DRA: process unschedulable Pods through ClusterSnapshot #7686

towca opened this issue Jan 9, 2025 · 0 comments
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/feature Categorizes issue or PR as related to a new feature. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@towca
Copy link
Collaborator

towca commented Jan 9, 2025

Which component are you using?:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

ClusterSnapshot was extended significantly for the DRA autoscaling MVP. In addition to tracking just Nodes and scheduled Pods, it now tracks the state of all DRA objects in the cluster. Some of these DRA objects are owned by unschedulable Pods. At the same time, the unschedulable Pods themselves are still tracked and processed outside ClusterSnapshot.

So we basically have the state for unschedulable Pods in two places:

  • The unschedulable Pods themselves are just a slice variable in StaticAutoscaler.RunOnce() that gets processed by PodListProcessor and then passed to ScaleUp.
  • The ResourceClaims owned by the unschedulable Pods are tracked and modified in dynamicresources.Snapshot inside ClusterSnapshot.

As pointed out by @MaciekPytel during the MVP review, this leaves us with a risk of the two data sources diverging quite easily. For example, a PodListProcessor could inject a "fake" unschedulable Pod to the list, but not inject the Pod's ResourceClaims to the ClusterSnapshot.

Describe the solution you'd like.:

  • Move unschedulable pods inside ClusterSnapshot.
    • Make ClusterSnapshot.SetClusterState() take all Pods in the cluster and divide them into scheduled and unschedulable internally.
    • We can probably implement the tracking (including correctly handling Fork()/Commit()/Revert()) pretty easily by putting the unschedulable Pods on a special meta-NodeInfo in the existing ClusterSnapshotStore implementations.
    • Pods move between the scheduled and unschedulable state during SchedulePod()/UnschedulePod() calls.
    • Add methods for obtaining and processing unschedulable Pods to ClusterSnapshot. We need at least ListUnschedulablePods(), AddUnschedulablePod(pod *framework.PodInfo), RemoveUnschedulablePod(name, namespace string).
    • We also need a way to mark some unschedulable Pods to be ignored during scale-up, but not actually remove them from the ClusterSnapshot. This is because their ResourceClaims could technically be partially allocated (so the Pod can't schedule yet, but it does reserve some Devices already), and then removing them from ClusterSnapshot would mean simulating some allocated Devices as free. This could be implemented via ClusterSnapshot.IgnoreUnschedulablePod(name, namespace string), but it might also fit better in the ScaleUp code itself.
  • Refactor ScaleUp to take the unschedulable Pods from ClusterSnapshot.ListUnschedulablePods() instead of a method parameter.
  • Refactor PodListProcessor and its implementations to modify the unschedulable Pods via the new ClusterSnapshot methods instead of modifying and returning a method parameter.

Additional context.:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.

@towca towca added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/feature Categorizes issue or PR as related to a new feature. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
None yet
Development

No branches or pull requests

2 participants