-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod schedule failure when restarting device plugin #812
Comments
will a new pod try to get started if you use a deployment for example? or it will have the same issue? |
A deployment would fix the issue, and that's what we have done. Reducing the devices to 0 would evict the currently running pods, which is undesirable. We would prefer to keep pods on while we change what resources are available. I have filed a bug with k/k here to see if there's something smart they can do: The best fix would be to find a way to change resources without having to restart the device plugin. Do you happen to know if it'd be easy to make the plugin more modular? We'd like to be able to just update the config map and send some signal to the plugin process, where the plugin then re-reads the files and updates the ListAndWatch datastructures |
that is something we would like to have in the device plugin. btw I am working on a drain improvement where we will switch the steps to
this way we may not even need to cordon the node |
I believe the config daemon comes with a switch "disableDrain: true", which we use |
do you have a one node? |
Hi @thatSteveFan I open a PR on u/s k8snetworkplumbingwg/sriov-network-device-plugin#502 can you give a try to this image? |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
e2e: Avoid setting wrong routes for `host-local` IPAM
When the device plugin is restarted at [1], kubelet marks the resource as unhealthy, but still reports the resource as existing for a grace period (5 mins) [2][3]. If a pod is scheduled before the device plugin comes up, the pod create fails without a retryloop with an error message
Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices <DEVICE_NAME>, which is unexpected
.[4]To reproduce:
Cause the device plugin to restart, then schedule a pod that uses a resource that the plugin registers. If your pod create is fast enough, the pod will crash. The window is narrow (~8s in our environment), so it's hard to hit.
[1] https://github.com/openshift/sriov-network-operator/blob/master/pkg/daemon/daemon.go#L645
[2] https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/devicemanager/manager.go#L417
[3] https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/devicemanager/endpoint.go#L76
[4] https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/devicemanager/manager.go#L594
The text was updated successfully, but these errors were encountered: