Hi @mYmNeo , I sometimes find that if a GPU pod is created while some GPU pods are being deleted or terminating, the UnexpectedAdmissionError will appear a little more frequently. I observed that the logic for gpu-admission to get active GPU pods on a node is different from that of gpu-manager. When gpu-admission get active pods, it seems to think the pods being deleted still occupies the GPUs, but gpu-manager will excludes these pods. So I think maybe their logic for getting active pods should also be consistent to reduce the occurrence of UnexpectedAdmissionError caused by inconsistent GPU selection.
Hi @mYmNeo , I sometimes find that if a GPU pod is created while some GPU pods are being deleted or terminating, the
UnexpectedAdmissionErrorwill appear a little more frequently. I observed that the logic forgpu-admissionto get active GPU pods on a node is different from that ofgpu-manager. Whengpu-admissionget active pods, it seems to think the pods being deleted still occupies the GPUs, butgpu-managerwill excludes these pods. So I think maybe their logic for getting active pods should also be consistent to reduce the occurrence ofUnexpectedAdmissionErrorcaused by inconsistent GPU selection.gpu-admission:
gpu-admission/pkg/predicate/gpu_predicate.go
Lines 213 to 215 in 47d56ae
gpu-manager:
https://github.com/tkestack/gpu-manager/blob/c961e77c3e65ef68299d0ba8ccb945b063896a03/pkg/services/watchdog/watchdog.go#L137-L138