-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA API: bump maximum size of ReservedFor to 256 #129543
base: master
Are you sure you want to change the base?
Conversation
/hold This and #129544 both need all necessary approvals before merging either of them. |
Thanks @pohly. As mentioned in the description we got preliminary support for this (unprecedented) update from @thockin and @liggitt in https://kubernetes.slack.com/archives/CJUQN3E4T/p1734593174791519 /lgtm |
LGTM label has been added. Git tree hash: f554f25029d1e3e9f7df66e32cb0154e6d9c9fd1
|
The original limit of 32 seemed sufficient for a single GPU on a node. But for shared non-local resources it is too low. For example, a ResourceClaim might be used to allocate an interconnect channel that connects all pods of a workload running on several different nodes, in which case the number of pods can be considerably larger. 256 is high enough for currently planned systems. If we need something even higher in the future, an alternative approach might be needed to avoid scalability problems. Normally, increasing such a limit would have to be done incrementally over two releases. In this case we decided on Slack (https://kubernetes.slack.com/archives/CJUQN3E4T/p1734593174791519) to make an exception and apply this change to current master for 1.33 and backport it to the next 1.32.x patch release for production usage. This breaks downgrades to a 1.32 release without this change if there are ResourceClaims with a number of consumers > 32 in ReservedFor. In practice, this breakage is very unlikely because there are no workloads yet which need so many consumers and such downgrades to a previous patch release are also unlikely. Downgrades to 1.31 already weren't supported when using DRA v1beta1.
800dc29
to
1cee368
Compare
/triage accepted fixing paperwork |
/lgtm After fixing the go doc too. |
LGTM label has been added. Git tree hash: ba118ff1ceaa34eda749c437d9cb8f4e50b29cc3
|
/approve For some reason I thought we got rid of v1a1. |
We support upgrades from 1.31 with v1alpha3 and it can still be enabled as API group if someone absolutely wants to try a DRA driver written for 1.31, but mostly it's just there to enable writing version skew tests. It's not the default storage version (that got bumped to v1beta1), so removing v1alpha3 can be done at any time. |
Cherry-pick is approved, ready to go... /hold cancel |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
/hold Let me test one slow test more thoroughly which didn't run in the presubmit job... |
/retest |
This needs a bit more time. I'll continue tomorrow. |
We want to be sure that the maximum number of pods per claim are actually scheduled concurrently. Previously the test just made sure that they ran eventually. Running 256 pods only works on more than 2 nodes, so network-attached resources have to be used. This is what the increased limit is meant for anyway. Because of the tightened validation of node selectors in 1.32, the E2E test has to use MatchExpressions because they allow listing node names.
New changes are detected. LGTM label has been removed. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: klueska, pohly, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
The original limit of 32 seemed sufficient for a single GPU on a node. But for shared non-local resources it is too low. For example, a ResourceClaim might be used to allocate an interconnect channel that connects all pods of a workload running on several different nodes, in which case the number of pods can be considerably larger.
256 is high enough for currently planned systems. If we need something even higher in the future, an alternative approach might be needed to avoid scalability problems.
Which issue(s) this PR fixes:
Required for NVIDIA GB200 and potentially Google TPU use cases.
Special notes for your reviewer:
Normally, increasing such a limit would have to be done incrementally over two releases. In this case we decided on
Slack (https://kubernetes.slack.com/archives/CJUQN3E4T/p1734593174791519) to make an exception and apply this change to current master for 1.33 and backport it to the next 1.32.x patch release for production usage.
This breaks downgrades to a 1.32 release without this change if there are ResourceClaims with a number of consumers > 32 in ReservedFor. In practice, this breakage is very unlikely because there are no workloads yet which need so many consumers and such downgrades to a previous patch release are also unlikely. Downgrades to 1.31 already weren't supported when using DRA v1beta1.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/assign @thockin
/cc @liggitt @johnbelamaric @klueska