The active pods considered by gpu-admission and gpu-manager are inconsistent

Hi @mYmNeo , I sometimes find that if a GPU pod is created while some GPU pods are being deleted or terminating, the `UnexpectedAdmissionError` will appear a little more frequently. I observed that the logic for `gpu-admission` to get *active GPU pods* on a node is different from that of `gpu-manager`. When `gpu-admission` get active pods, it seems to think the pods being deleted still occupies the GPUs, but `gpu-manager` will excludes these pods. So I think maybe their logic for getting active pods should also be consistent  to reduce the occurrence of `UnexpectedAdmissionError` caused by inconsistent GPU selection.

- gpu-admission:
https://github.com/tkestack/gpu-admission/blob/47d56ae99ef7f24f2c9c4d33d17567e2e52f3ba2/pkg/predicate/gpu_predicate.go#L213-L215

- gpu-manager:
https://github.com/tkestack/gpu-manager/blob/c961e77c3e65ef68299d0ba8ccb945b063896a03/pkg/services/watchdog/watchdog.go#L137-L138


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The active pods considered by gpu-admission and gpu-manager are inconsistent #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if (pod.Spec.NodeName == node.Name \|\| predicateNode == node.Name) &&
	pod.Status.Phase != corev1.PodSucceeded &&
	pod.Status.Phase != corev1.PodFailed {

The active pods considered by gpu-admission and gpu-manager are inconsistent #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions