Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA API: bump maximum size of ReservedFor to 256 #129543

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jan 9, 2025

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

The original limit of 32 seemed sufficient for a single GPU on a node. But for shared non-local resources it is too low. For example, a ResourceClaim might be used to allocate an interconnect channel that connects all pods of a workload running on several different nodes, in which case the number of pods can be considerably larger.

256 is high enough for currently planned systems. If we need something even higher in the future, an alternative approach might be needed to avoid scalability problems.

Which issue(s) this PR fixes:

Required for NVIDIA GB200 and potentially Google TPU use cases.

Special notes for your reviewer:

Normally, increasing such a limit would have to be done incrementally over two releases. In this case we decided on
Slack (https://kubernetes.slack.com/archives/CJUQN3E4T/p1734593174791519) to make an exception and apply this change to current master for 1.33 and backport it to the next 1.32.x patch release for production usage.

This breaks downgrades to a 1.32 release without this change if there are ResourceClaims with a number of consumers > 32 in ReservedFor. In practice, this breakage is very unlikely because there are no workloads yet which need so many consumers and such downgrades to a previous patch release are also unlikely. Downgrades to 1.31 already weren't supported when using DRA v1beta1.

Does this PR introduce a user-facing change?

DRA API: the maximum number of pods which can use the same ResourceClaim is now 256 instead of 32. Beware that downgrading a cluster where this relaxed limit is in use to Kubernetes 1.32.0 is not supported because 1.32.0 would refuse to update ResourceClaims with more than 32 entries in the status.reservedFor field.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/4381

/assign @thockin
/cc @liggitt @johnbelamaric @klueska

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/code-generation sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 9, 2025
@pohly
Copy link
Contributor Author

pohly commented Jan 9, 2025

/hold

This and #129544 both need all necessary approvals before merging either of them.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025
@klueska
Copy link
Contributor

klueska commented Jan 9, 2025

Thanks @pohly.

As mentioned in the description we got preliminary support for this (unprecedented) update from @thockin and @liggitt in https://kubernetes.slack.com/archives/CJUQN3E4T/p1734593174791519

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: f554f25029d1e3e9f7df66e32cb0154e6d9c9fd1

The original limit of 32 seemed sufficient for a single GPU on a node. But for
shared non-local resources it is too low. For example, a ResourceClaim might be
used to allocate an interconnect channel that connects all pods of a workload
running on several different nodes, in which case the number of pods can be
considerably larger.

256 is high enough for currently planned systems. If we need something even
higher in the future, an alternative approach might be needed to avoid
scalability problems.

Normally, increasing such a limit would have to be done incrementally over two
releases. In this case we decided on
Slack (https://kubernetes.slack.com/archives/CJUQN3E4T/p1734593174791519) to
make an exception and apply this change to current master for 1.33 and backport
it to the next 1.32.x patch release for production usage.

This breaks downgrades to a 1.32 release without this change if there are
ResourceClaims with a number of consumers > 32 in ReservedFor. In practice,
this breakage is very unlikely because there are no workloads yet which need so
many consumers and such downgrades to a previous patch release are also
unlikely. Downgrades to 1.31 already weren't supported when using DRA v1beta1.
@pohly pohly force-pushed the dra-reserved-for-limit branch from 800dc29 to 1cee368 Compare January 9, 2025 13:27
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2025
@k8s-ci-robot k8s-ci-robot requested a review from thockin January 9, 2025 13:27
@ffromani
Copy link
Contributor

ffromani commented Jan 9, 2025

/triage accepted
/priority important-soon

fixing paperwork

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 9, 2025
@sttts
Copy link
Contributor

sttts commented Jan 9, 2025

/lgtm

After fixing the go doc too.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ba118ff1ceaa34eda749c437d9cb8f4e50b29cc3

@thockin
Copy link
Member

thockin commented Jan 9, 2025

/approve

For some reason I thought we got rid of v1a1.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 9, 2025
@pohly
Copy link
Contributor Author

pohly commented Jan 9, 2025

We support upgrades from 1.31 with v1alpha3 and it can still be enabled as API group if someone absolutely wants to try a DRA driver written for 1.31, but mostly it's just there to enable writing version skew tests.

It's not the default storage version (that got bumped to v1beta1), so removing v1alpha3 can be done at any time.

@pohly
Copy link
Contributor Author

pohly commented Jan 9, 2025

Cherry-pick is approved, ready to go...

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025
@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@pohly
Copy link
Contributor Author

pohly commented Jan 9, 2025

/hold

Let me test one slow test more thoroughly which didn't run in the presubmit job...

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025
@liggitt
Copy link
Member

liggitt commented Jan 9, 2025

/retest

@pohly
Copy link
Contributor Author

pohly commented Jan 9, 2025

Let me test one slow test more thoroughly which didn't run in the presubmit job...

This needs a bit more time. I'll continue tomorrow.

We want to be sure that the maximum number of pods per claim are actually
scheduled concurrently. Previously the test just made sure that they ran
eventually.

Running 256 pods only works on more than 2 nodes, so network-attached resources
have to be used. This is what the increased limit is meant for anyway. Because
of the tightened validation of node selectors in 1.32, the E2E test has to
use MatchExpressions because they allow listing node names.
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jan 10, 2025
@k8s-ci-robot k8s-ci-robot requested a review from sttts January 10, 2025 08:50
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska, pohly, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/code-generation area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Development

Successfully merging this pull request may close these issues.

8 participants