Pod schedule failure when restarting device plugin #812

thatSteveFan · 2023-08-18T23:54:19Z

When the device plugin is restarted at [1], kubelet marks the resource as unhealthy, but still reports the resource as existing for a grace period (5 mins) [2][3]. If a pod is scheduled before the device plugin comes up, the pod create fails without a retryloop with an error message Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices <DEVICE_NAME>, which is unexpected.[4]

To reproduce:

Cause the device plugin to restart, then schedule a pod that uses a resource that the plugin registers. If your pod create is fast enough, the pod will crash. The window is narrow (~8s in our environment), so it's hard to hit.

[1] https://github.com/openshift/sriov-network-operator/blob/master/pkg/daemon/daemon.go#L645
[2] https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/devicemanager/manager.go#L417
[3] https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/devicemanager/endpoint.go#L76
[4] https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/devicemanager/manager.go#L594

The text was updated successfully, but these errors were encountered:

SchSeba · 2023-08-27T13:15:06Z

will a new pod try to get started if you use a deployment for example? or it will have the same issue?
if that is the case I will expect a fix in core K8S for this issue no?
another solution is to reduce the devices to 0 before we shutdown the device plugin Will that work?

thatSteveFan · 2023-08-28T18:33:40Z

A deployment would fix the issue, and that's what we have done.

Reducing the devices to 0 would evict the currently running pods, which is undesirable. We would prefer to keep pods on while we change what resources are available.

I have filed a bug with k/k here to see if there's something smart they can do:
kubernetes/kubernetes#120146

The best fix would be to find a way to change resources without having to restart the device plugin. Do you happen to know if it'd be easy to make the plugin more modular? We'd like to be able to just update the config map and send some signal to the plugin process, where the plugin then re-reads the files and updates the ListAndWatch datastructures

SchSeba · 2023-08-29T09:51:24Z

that is something we would like to have in the device plugin.
BUT reset of the device plugin will be done only after we drain the node so not workloads should be running on the system or I am missing something.

btw I am working on a drain improvement where we will switch the steps to

remove the device plugin from the node
remove all the pods using sriov devices only (not all the pods as we do today)
reconfigure the node
return the device plugin

this way we may not even need to cordon the node

thatSteveFan · 2023-08-29T22:48:01Z

I believe the config daemon comes with a switch "disableDrain: true", which we use

SchSeba · 2023-09-04T13:03:33Z

do you have a one node?
if not the disable skip is not recommended

SchSeba · 2023-10-21T15:44:23Z

Hi @thatSteveFan I open a PR on u/s k8snetworkplumbingwg/sriov-network-device-plugin#502

can you give a try to this image?

openshift-bot · 2024-01-20T01:00:44Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2024-02-19T08:30:23Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2024-03-21T00:00:30Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2024-03-21T00:00:42Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

e2e: Avoid setting wrong routes for `host-local` IPAM

SchSeba mentioned this issue Oct 21, 2023

update the capacity to zero on shutdown/reset k8snetworkplumbingwg/sriov-network-device-plugin#502

Open

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 19, 2024

openshift-ci bot closed this as completed Mar 21, 2024

SchSeba pushed a commit to SchSeba/sriov-network-operator that referenced this issue Dec 5, 2024

Merge pull request openshift#812 from zeeke/us/fix-test-ip-routes

07a7f05

e2e: Avoid setting wrong routes for `host-local` IPAM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod schedule failure when restarting device plugin #812

Pod schedule failure when restarting device plugin #812

thatSteveFan commented Aug 18, 2023

SchSeba commented Aug 27, 2023

thatSteveFan commented Aug 28, 2023

SchSeba commented Aug 29, 2023

thatSteveFan commented Aug 29, 2023

SchSeba commented Sep 4, 2023

SchSeba commented Oct 21, 2023

openshift-bot commented Jan 20, 2024

openshift-bot commented Feb 19, 2024

openshift-bot commented Mar 21, 2024

openshift-ci bot commented Mar 21, 2024

Pod schedule failure when restarting device plugin #812

Pod schedule failure when restarting device plugin #812

Comments

thatSteveFan commented Aug 18, 2023

SchSeba commented Aug 27, 2023

thatSteveFan commented Aug 28, 2023

SchSeba commented Aug 29, 2023

thatSteveFan commented Aug 29, 2023

SchSeba commented Sep 4, 2023

SchSeba commented Oct 21, 2023

openshift-bot commented Jan 20, 2024

openshift-bot commented Feb 19, 2024

openshift-bot commented Mar 21, 2024

openshift-ci bot commented Mar 21, 2024