Skip to content

Conversation

@alaypatel07
Copy link
Contributor

@alaypatel07 alaypatel07 commented Oct 3, 2025

  • One-line PR description:
    KEP-5304: Adding downward API for DRA Device Attributes to Pod
  • Other comments:

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Oct 3, 2025
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 3, 2025
@alaypatel07
Copy link
Contributor Author

/wg-device-management

@alaypatel07 alaypatel07 force-pushed the kep-5304 branch 3 times, most recently from 3c92417 to 3557abc Compare October 3, 2025 16:27
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alaypatel07
Once this PR has been reviewed and has the lgtm label, please assign dchen1107, johnbelamaric for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 3, 2025
@alaypatel07 alaypatel07 force-pushed the kep-5304 branch 2 times, most recently from 92f13db to 96e3424 Compare October 3, 2025 17:03
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 3, 2025
@alaypatel07 alaypatel07 force-pushed the kep-5304 branch 2 times, most recently from f1d195c to 6568154 Compare October 3, 2025 17:04
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 3, 2025
@kannon92
Copy link
Contributor

kannon92 commented Oct 3, 2025

Please combine PRR into this PR.

We usually review as one PR.

@alaypatel07
Copy link
Contributor Author

@kannon92 This PR already has the PRR. If a single PR is to be reviewed, all I need to do is close the other PR

// DRADeviceFieldRef selects a DRA-resolved device attribute for a given claim+request.
// +featureGate=DRADownwardDeviceAttributes
// +structType=atomic
type DRADeviceFieldRef struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can map to more than one device, since each request might ask for multiple devices. How is the data surfaced in the env variables or the volume? The DRA device names doesn't necessarily map to any identifier that is known to the consumer of this information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section Multi-device requests, in Kubelet Impelmentation

Multi-device requests:

When deviceIndex is unset, kubelet resolves the attribute across all allocated devices for the request, preserving allocation order, and joins values with a comma (",") into a single string. Devices that do not report the attribute are skipped. If no devices provide the attribute, the value is considered not ready.

When deviceIndex is set, kubelet selects the device at that zero-based index from the allocation results and resolves the attribute for that device only. If the index is out of range or the attribute is missing on that device, the value is considered not ready.

3. Watches ResourceSlices: Resolves standardized attributes from `spec.devices[*].attributes` for the matching device name
4. Maintains Cache: Keeps a per-Pod map of `(claimName, requestName) -> {attribute: value}` with a readiness flag

Resolution Semantics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about error handling here? I'm wondering what happens if:

  • A claim or request can't be found?
  • The requested attribute is not available for one or more of the allocated devices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this: #5606 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the section Failure on missing data

Copy link
Member

@mortent mortent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments, but overall this looks good to me.

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor things to resolve, otherwise LGTM including PRR

- Older kubelet without the feature will ignore `resourceSliceAttributeRef` (it is dropped during decoding)
- Containers still start; env vars/volumes referencing `resourceSliceAttributeRef` will not be populated
- **Risk**: Workloads relying on these values may misbehave
- **Mitigation**: Avoid relying on the field until all kubelets are upgraded; gate scheduling to upgraded nodes (e.g., using node labels/taints) or keep the feature gate disabled on the API server until nodes are updated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider integrating with #5328

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, does this need any work in the KEP or should this be an implementation issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you plan to do it, I would note it here. Non-blocking though.

Signed-off-by: Alay Patel <[email protected]>
@alaypatel07 alaypatel07 force-pushed the kep-5304 branch 2 times, most recently from 71d47e6 to c96efd1 Compare October 16, 2025 03:24
@johnbelamaric
Copy link
Member

PRR is OK now, just need approval from SIG Node. cc @klueska @mrunalp @dchen1107 @SergeyKanzhelev

@SergeyKanzhelev
Copy link
Member

@klueska is marked as a primary approver in kep.yaml and I would really like to hear if this is what is needed.

### Goals

- Provide a stable Downward API path for device attributes associated with `pod.spec.resourceClaims[*]` requests
- Support device attributes from `ResourceSlice` that are requested by user in pod spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to provide the DRA name beyond attributes? Are there some fungability scenario when pod doesn't even know which DRA gave it a resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pods will always know what DRA gave it resource through the combination of resource claim name + request name + device name.

However, in case of prioritized list feature, it is possible that user asks for a device in oneOf(set of requests) mechanism. In that case a user might have to make sure all the requests have the attribute. This is a user/UX problem however

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is UX problem. If we envision some kind of one-of, than user will need to know which one was picked. So the information about DRA plugin needs to be exposed via downwards API as well. At least this is how I understand the scenario here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting points @SergeyKanzhelev. Yes, in the prioritized list, if you get different devices some will have the attribute and some might not. If one is a GPU and one is a CPU, for example.

On the one hand, it's not clear passing via downward API makes sense in those use cases. So, should that block doing it in a use case where it does make sense? On the other hand, these kinds of "works with this but not that" creates real friction for users. Hmm.

In the fungibility use cases with prioritized list, there are two current strategies for the containers: we can run a single container that is able to look at which devices it got and adjust appropriately, or we can run multiple containers and have one sleep if it's the wrong one. If we go with "missing attribute causes pod failure at container start time", these two features won't work together, because pods will fail if the selected device doesn't have the attribute.

We have also considered adding binding conditions in the resource claim and then allowing a controller to mutate the container image in PreBind. If we could also mutate the downward API spec, we could make them work together with that. But I don't think we can mutate that (I suspect it's not mutable).

So, two possible solutions:

  1. Somehow tie the downward API spec to the Device Request rather than the container config (seems totally wrong)
  2. Don't fail if attributes are missing

Copy link
Contributor Author

@alaypatel07 alaypatel07 Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another solution to this, the attribute mentioned in the pod spec downward API, could be added as a required attribute to the resource claim request. This will make sure the scheduler only picks the devices which has this attributes. However, we still have to manage runtime issues like "resource slices going missing during pod creation".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that doesn't work in the fungibility case. it should be legit for us to pick a device that only works for one of two containers, based on the "two container" strategy above. And with the "one container" strategy, we would need to know which attributes to publish from which devices, based on the request choice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie, if we put it in the selectors of the subrequest, then we would only want to run the container that will make use of that subrequest, but we have no way to specify such a thing today


### Non-Goals

- Expose the entirety of `ResourceClaim`/`ResourceSlice` objects in the Downward API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so realistically a Pod can only rely on attributes that were specified in CEL expression? Other attributes may not exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I dont follow the question. The pod has the link to resource claim which provided the device. The resource claim status has the request name + device name and the device name in resource slice has the attribute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the question is kind of similar to #5606 (comment)

If one wants to follow the defensive programming and ensure that all attributes it asks for in the Pod are present on devices that DRA gave it - the only way to do it is to specify those atributes somehow in CEL expression.

One can rely on the "knowledge of devices". But it is not very reliable and makes containers crash long after scheduling, making it harder to investigate

### Notes/Constraints/Caveats (Optional)

- Environment variables are set at container start time: Once a container starts, its environment variables are immutable. If device attributes change after container start, env vars will not reflect the change.
- Resolution timing: Attributes are resolved at container start time (not at allocation time). There is no scheduler-side copying of attributes into `ResourceClaim`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note the kubelet restart or container crash scenarios. The behavior must be declared in those cases. Ideally something aligned with #3721

We need to clearly articulate that the crash/restart of a container may lead to it's unavilability to start if attribute or resource has disappeared

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to clearly articulate that the crash/restart of a container may lead to it's unavilability to start if attribute or resource has disappeared

I agree, I will add this.


- Environment variables are set at container start time: Once a container starts, its environment variables are immutable. If device attributes change after container start, env vars will not reflect the change.
- Resolution timing: Attributes are resolved at container start time (not at allocation time). There is no scheduler-side copying of attributes into `ResourceClaim`.
- ResourceSlice churn: Resolution uses the contents of the matching `ResourceSlice` at container start. If the `ResourceSlice` (or the requested attribute) is missing at that time, kubelet records an event and fails the pod start.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why Pod fails to start? Do you mean container fails to start and pod follows whatever the restart policy is? Or you explictily want to change the Pod error handing for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean container fails to start and pod follows whatever the restart policy is?

I mean this, yes. I will clarify. There are two factors here, the resource claim is shared with all containers, but the downward API is for specific containers inside the pod. So this needs to be clearly stated.


## Motivation

Workloads that need to interact with DRA-allocated devices (like KubeVirt virtual machines) require access to device-specific identifiers such as PCIe bus addresses or mediated device UUIDs. In order to fetch the attributes from allocated device, users first have to go to ResourceClaimStatus, find the request and device name, and then look up the resource slice with device name to get the attribute value. Ecosystem project like KubeVirt must resort to custom controllers that watch these objects and inject attributes via annotations/labels or other custom mechanisms, often leading to fragile, error-prone and racy designs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do people implement it today with Device Plugin and why DRA requirements are new?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may need to be filled up in Alternatives section of this KEP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do people implement it today with Device Plugin and why DRA requirements are new?

With device plugins, the drivers are expected to populate this environment variable based on the name of the device.

However in case of DRA, since there is an indirection in the API, it takes three steps to get to the name of the device, lookup in Pod Spec to find the resource claim name, lookup in resource claim status to find the device name, lookup in resource slice to find the attribute value. So there are two options:

  1. ask the drivers to implement env variable PCI_CLAIMNAME_REQUESTNAME_DEVICENAME=<attribute_value>. As you can see, the env API is very constricted, having three levels of indirection is very hard to generate and find this information for drivers and workloads
  2. Write custom controllers to infer this value and populate the env variable. This is how it is implemented in KubeVirt now, as an alpha feature, however, it requires setting the attribute value in Kubevirt CR status. This creates problems when KubeVirt tries to migrate VM from one node to another, where the attribute values has to change and coordinated. It is desired that if the pod has the env variable it can just come up on the new node and find its devices metadata

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. ask the drivers to implement env variable PCI_CLAIMNAME_REQUESTNAME_DEVICENAME=<attribute_value>. As you can see, the env API is very constricted, having three levels of indirection is very hard to generate and find this information for drivers and workloads

The DRA driver is publishing those attributes in the first place. So it should know what to inject. Or this is for scenarios when attributes are not contolled by the DRA plugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the driver that is publishing those attributes correct, but the issue is with fulfilling the contract between driver and KubeVirt. For KubeVirt to generate the right domxml for the device, it needs to know the GPU name that is configured, see this: https://github.com/kubevirt/kubevirt/blob/559fae099c734c7ba61332caef06567e9f572ddf/pkg/virt-launcher/virtwrap/device/hostdevice/dra/gpu_hostdev.go#L78-L83

However, this name is purely in KubeVirt workload spec, it is not available to the driver. So once the consumer discovers PCI_CLAIMNAME_REQUESTNAME_DEVICENAME env variable set by driver, it has to reverse lookup the VMI spec to find the device name for it. If this instead implemented as a contract between pod spec and kubevirt then it is much easier to discover the device attributes with device from inside the pod.

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the issue is that one of the attributes is not injected as env variable today and the assumption is that it will be best to declare which attributes are needed in pod spec than update DRA plugin to inject more attributes to all containers?

If this is the main scenario, it may be interesting to explore if all env vars must be declared this way as the best practice. Having a mix of auto-injected and declared sounds like a trouble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will make the contract between the consumers and producers of device metadata information much simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use the NodePrepareResources hook to plumb down the information to the driver?, we are already passing the Claim there to the driver


#### Story 2

As a DRA driver author, I want my driver to remain unchanged while allowing applications to consume device attributes (like `resource.kubernetes.io/pcieRoot` or `dra.kubervirt.io/mdevUUID`) through the native Kubernetes Downward API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versioning story between DRA plugin and Pods will be interesting here. New DRA driver needs to fully rollout before new attributes can be consumed. However Pod has no means to check the DRA driver version on the node when scheduling.

Is this something we want to handle in this KEP? Will some kind of a CEL statement can solve this problem? Like semver check of a DRA version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versioning story between DRA plugin and Pods will be interesting here. New DRA driver needs to fully rollout before new attributes can be consumed. However Pod has no means to check the DRA driver version on the node when scheduling.

I agree with this, however, this is a separate problems that will surface in other parts like usage of attributes in CEL expression etc. IMHO this should be a separate effort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage in CEL should not affect whether pod can start. or you are saying that the CEL expression can be used to determine the DRA version for proper allocation of the Pod on nodes with the "fresh" DRA version? If so - it is worth mentioning in the KEP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage in CEL should not affect whether pod can start.

CEL usage in ResourceClaim does affect pod startup in the sense, if user requests an attribute through CEL expression that is not present due to driver upgrade, the pod will be stuck in scheduling state forever. So what I am saying is that versioning of attributes is completely separate unsolved problem, we have discussed this in #wg-device-management meeting, but unfortunately it isnt tracked anywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we are saying the same thing. I just prefer not scheduled things than occupying space and crashlooping. So solution here may be that Pod is not being scheduled before scheduler sees these attributes on a device. Do you see scenario when it's best to schedule Pod and let it wait for attribute to appear on a device?


####

### Kubelet Implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want the presence of attributes to be a runtime failure as described? Or make it pod admission failure and proactively check the whole Pod's containers ahead of time when pod sandbox is being created?

The bad thing about runtime failure - for pods with restart policy Always -runtime failure will mean that the Pod will get stuck in crash loop backoff

Copy link
Contributor Author

@alaypatel07 alaypatel07 Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want the presence of attributes to be a runtime failure as described? Or make it pod admission failure and proactively check the whole Pod's containers ahead of time when pod sandbox is being created?

do this proactive checks get re-tried or does it drive the pod into terminal state? I am worried about the slow informers case where if the attributes arrive later than pre-sandbox check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should very much be the odd and exceptional case that the RS is late or gone. Remember, it had to already exist to be selected. I am not worried about races where somehow the driver runs, publishes the resource slice, pod gets scheduled, the kubelet picks it up, but it hasn't yet seen the RS. That seems highly unlikely. The RS informer would have to be way way behind the Pod informer on the same kubelet and apiserver connection, which AFAIK seems unrealistic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric for the attribute resolution to happen, we need both RS informer and ResourceClaim(RC) informer to be caught up. While I agree with you that RS slow seems unrealistic(and probably not worth solving for, assuming a lot of other things will fail at that point) but the RC informer which provides the latest RC that was created for this pod could be behind as well, leading to issues I mentioned above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so image you have a large training job with restartPolicy=Never. Does this slow down of informers that can be because of a scale of a cluster may affect the job and some pods will simply not start?

Since the restartPolicy=Never, container will not try to restart after the first failure so the whole job will be jeopardized.

- Failure on missing data: If the `ResourceSlice` is not found, or the attribute is absent on any allocated device at container start, kubelet records a warning event and returns an error to the sync loop. The pod start fails per standard semantics (e.g., `restartPolicy` governs restarts; Jobs will fail the pod).
- Multi-device requests: Kubelet resolves the attribute across all allocated devices for the request, preserving allocation order, and joins values with a comma (",") into a single string. If any allocated device does not report the attribute, resolution fails (pod start error).

Security & RBAC:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all attributes that DRA exposes OK for Pod to consume? Are there any sensitive information that cluster administators may need to keep away from users in multi-tenant environments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this: #5606 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add this to the non-goals than to avoid confusion.


- Name: `DRADownwardDeviceAttributes`
- Stage: Alpha (v1.35)
- Components: kube-apiserver, kubelet, kube-scheduler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how kube-scheduler is affected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kube-scheduler is not affected here, I have the feature gate just for consistency with other DRA features. I can remove it however. Extra flags are of now use.

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this is aligned with my unerstanding then. Please remove it from here. I was very confused by what changes scheduler will need here

participating-sigs: []
status: implementable
creation-date: 2025-10-02
reviewers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need sig node reviewer here. You can use my name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@SergeyKanzhelev
Copy link
Member

SergeyKanzhelev commented Oct 16, 2025

@klueska is marked as a primary approver in kep.yaml and I would really like to hear if this is what is needed.

I added review on KEP mechanics. Clarifying those will be great before merging. When clarifying behaviors, please add mention of a corresponding tests into the tests section.

On semantics side I do not have strong understanding. I added a couple comments on this, but @klueska's feedback will be very useful

@SergeyKanzhelev
Copy link
Member

I tried to summarize decisions needed in this KEP:

This KEP introduces new failure mode (missing attribute) to the "late" stage of Container start. This bubbles the complexity of implementing fungability scenarios and ensuring version match between DRA plugin and Pod to the control plane. The same time this decisions opens opportunities for future scenarios like late "discovering" of attributes. However DRA today in a state when all attributes are static and known to DRA plugin. This will likely force developers to write additional CEL conditions for each attribute they use in downwards API to make sure their Pods will never be scheduled on a node where they will be crashing continuously. Moreover, KEP is not attempting to eliminate or discourage automatic env vars injections by plugins that works today. Making the state of things more confusing.

If we believe that long term we will need flexibility of attributes being sourced from different controllers, and ability to schedule Pod so it will wait for the attribute availability, this KEP will enable those.

If we believe that attributes will more or less stay static, than moving failure to earlier stages - all the way to scheduling - would make the most sense.

Lastly, the scenario driver today is the fact that the DRA plugin is not injecting all attributes to the Pod. If we believe DRA plugins will continue increasing the number of attributes and most of them will only be meaningful to a subset of workloads, this KEP makes sense. If we see that there are handful of attributes any workload ever need, and DRA plugin is OK to inject them all, this KEP is not bringing much value.

Comment on lines 199 to 201
2. Watches `ResourceClaim` objects in the Pod's namespace to retrieve allocation information
3. Watches `ResourceSlice` objects for the node and driver to resolve device attributes
4. Maintains a per-Pod cache of `(claimName, requestName) -> {attribute: value}` mappings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a serious scalability impact

Copy link
Contributor

@klueska klueska Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want the kubelet to be in the business of watching all ResourceSlices. We would need a way of allowing the kubelet to quickly look up the device in a specific resource slice on-demand (possibly caching it upon first look up). Either that, or make sure that it only opened a watch for ResourceSlices matching the current node.

Comment on lines 287 to 292
The kubelet runs a local DRA attributes controller that:

1. Watches Pods: Identifies Pods on the node with `pod.spec.resourceClaims` and tracks their `pod.status.resourceClaimStatuses` to discover generated ResourceClaim names
2. Watches ResourceClaims: For each relevant claim, reads `status.allocation.devices.results[*]` and maps entries by request name
3. Watches ResourceSlices: Resolves standardized attributes from `spec.devices[*].attributes` for the matching device name
4. Maintains Cache: Keeps a per-Pod map of `(claimName, requestName) -> {attribute: value}` with a readiness flag
Copy link
Member

@aojea aojea Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate on the implementation?
Does it open a new watch per ResourceClaim and ResourceSlice?
The Kubelet already does a Get on the NodePrepareResources hook and passes the claim to the driver

claimName: pgpu-claim
requestName: pgpu-request
attribute: resource.kubernetes.io/pcieRoot
# If multiple devices are allocated for this request, values are joined with "," in allocation order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels fragile

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants