cluster-api: node template in scale-from-0-nodes scenario with DRA #7724

ttsuuubasa · 2025-01-20T04:15:55Z

Which component are you using?:

/area cluster-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Cluster Autoscaler was extended significantly for DRA and there are parts that cluster-api also needs to address as Cloud Provider.

Cluster Autoscaler has a way to scale-up that called "scale-from-0-nodes" scenario where there are no existing Nodes in NodeGroup and then a new Node spawns from there.
In this case, each Cloud Provider is responsible for providing a template node (NodeInfo) by TemplateNodeInfo() and it has the resource information of the node like CPU, memory and GPU.

At the age of Device Plugin, cluster-api provided the node template by using Annotation added to NodeGroup such as MachineSet and MachineDeployment.[1]
However, for DRA, cluster-api has not yet implemented the logic to create the template of ResourceSlice from this information at this point.

For example, when users want to spawn the node with GPU as DRA resources in NodeGroup where there is no existing node. Although it's for a pending pod that requires GPU with ResourceClaim, they could have no option to execute it and it is likely not to work.

Therefore, cluster-api needs the feature for users to specify devices in ResourceSlice of the node to be spawned in scale-from-0-nodes scenario.

[1] https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/README.md#scale-from-zero-support

Describe the solution you'd like.:

The simplest idea is adding the more annotations into the NodeGroup to be the basis of ResourceSlice like the following.

capacity.cluster-autoscaler.kubernetes.io/dra-driver: gpu.nvidia.com
capacity.cluster-autoscaler.kubernetes.io/dra-pool  : <pool-name>

The text was updated successfully, but these errors were encountered:

elmiko · 2025-01-29T18:09:50Z

we are discussing this issue in the cluster-api office hours today, it sounds there is general agreement. we would like to do a little research on how this might be included as part of the normal API in addition to annotations.

in general, this seems like a +1 from the community.

enxebre · 2025-01-30T13:41:00Z

/area provider/cluster-api

k8s-triage-robot · 2025-04-30T14:28:48Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

elmiko · 2025-04-30T19:30:45Z

i believe this is still in progress.

/remove-lifecycle stale

ttsuuubasa added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 20, 2025

k8s-ci-robot added the area/cluster-autoscaler label Jan 20, 2025

k8s-ci-robot added the area/provider/cluster-api Issues or PRs related to Cluster API provider label Jan 30, 2025

ttsuuubasa mentioned this issue Feb 5, 2025

cluster-api: node template in scale-from-0-nodes scenario with DRA #7804

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2025

k8s-ci-robot closed this as completed in #7804 May 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cluster-api: node template in scale-from-0-nodes scenario with DRA #7724

cluster-api: node template in scale-from-0-nodes scenario with DRA #7724

ttsuuubasa commented Jan 20, 2025

elmiko commented Jan 29, 2025

Uh oh!

enxebre commented Jan 30, 2025

Uh oh!

k8s-triage-robot commented Apr 30, 2025

Uh oh!

elmiko commented Apr 30, 2025

Uh oh!

cluster-api: node template in scale-from-0-nodes scenario with DRA #7724

cluster-api: node template in scale-from-0-nodes scenario with DRA #7724

Comments

ttsuuubasa commented Jan 20, 2025

elmiko commented Jan 29, 2025

Uh oh!

enxebre commented Jan 30, 2025

Uh oh!

k8s-triage-robot commented Apr 30, 2025

Uh oh!

elmiko commented Apr 30, 2025

Uh oh!