DRA: Handle permanent driver failures

### Enhancement Description

DRA drivers may encounter errors such that the devices allocated by kube-scheduler for a pod can never be successfully returned from the `NodePrepareResources` gRPC call to the driver. Currently, pods in that state will be continuously retried forever, wasting CPU cycles in the kubelet and DRA driver. This proposal describes a method to break that cycle of continuous retries that are known will fail.

/sig node
/wg device-management
/assign @nojnhuh
/cc @pohly @lauralorenz @SergeyKanzhelev 

- One-line enhancement description (can be used as a release note): DRA: Handle permanent driver failures
- Kubernetes Enhancement Proposal: https://github.com/nojnhuh/enhancements/blob/5322-dra-perma-err/keps/sig-node/5322-dra-driver-permanent-failure/README.md
- Discussion Link: 
  - [13 May 2025 SIG Node meeting](https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit?usp=sharing)
  - https://github.com/kubernetes/kubernetes/issues/125542
- Primary contact (assignee): @nojnhuh
- Responsible SIGs: SIG Node
- Enhancement target (which target equals to which milestone):
  - Alpha release target (x.y):
  - Beta release target (x.y):
  - Stable release target (x.y):
- [ ] Alpha
  - [ ] KEP (`k/enhancements`) update PR(s):
    - https://github.com/kubernetes/enhancements/pull/5549 
  - [ ] Code (`k/k`) update PR(s):
  - [ ] Docs (`k/website`) update PR(s):



_Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRA: Handle permanent driver failures #5322

Enhancement Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DRA: Handle permanent driver failures #5322

Description

Enhancement Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions