Skip to content

DRA: Handle permanent driver failures #5322

@nojnhuh

Description

@nojnhuh

Enhancement Description

DRA drivers may encounter errors such that the devices allocated by kube-scheduler for a pod can never be successfully returned from the NodePrepareResources gRPC call to the driver. Currently, pods in that state will be continuously retried forever, wasting CPU cycles in the kubelet and DRA driver. This proposal describes a method to break that cycle of continuous retries that are known will fail.

/sig node
/wg device-management
/assign @nojnhuh
/cc @pohly @lauralorenz @SergeyKanzhelev

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

Metadata

Metadata

Assignees

Labels

sig/nodeCategorizes an issue or PR as relevant to SIG Node.tracked/noDenotes an enhancement issue is NOT actively being tracked by the Release Teamwg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

Type

No type

Projects

Status

🏗 In progress

Status

Removed

Status

Removed from Milestone

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions