[BUG] Scale down of deployment left pod references undeleted #483

smoshiur1237 · 2024-06-20T12:38:38Z

Describe the bug
We are having a situation in our test cluster where Pod references left undeleted during scale down operation of a deployment. It get increased if we do multiple scale down/up operation. After scaling down of deployment from 200 to 1, most of the pods stays in Terninating state and needs long time to get deleted from the cluster. After the deletion some of the pod references are still visible and may increase after multiple scale up/down. Here is listed some queries during the process:

Scale down form 200 to 1 and when most of the pods are in Terminating state

date && kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  2.2.2.0-24 -o yaml | grep -c podref 
Thu Jun 20 06:56:08 UTC 2024
161
date && kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  3.3.3.0-24 -o yaml | grep -c podref
Thu Jun 20 06:53:34 UTC 2024
121

date && kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  4.4.4.0-24 -o yaml | grep -c podref 
Thu Jun 20 06:53:55 UTC 2024
66

date && kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  5.5.5.0-24 -o yaml | grep -c podref
Thu Jun 20 06:54:36 UTC 2024
11

kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  6.6.6.0-24 -o yaml | grep -c podref 
Thu Jun 20 06:54:56 UTC 2024
1

kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  7.7.7.0-24 -o yaml | grep -c podref 
Thu Jun 20 06:55:15 UTC 2024
1

date && kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  8.8.8.0-24 -o yaml | grep -c podref 
Thu Jun 20 06:55:35 UTC 2024
1

kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  9.9.9.0-24 -o yaml | grep -c podref 
Thu Jun 20 06:55:55 UTC 2024
1

When scale down from 200 to 1 is completed and all replicas are deleted

date && kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  2.2.2.0-24 -o yaml | grep -c podref
Thu Jun 20 08:30:53 UTC 2024
2
date && kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  3.3.3.0-24 -o yaml | grep -c podref
Thu Jun 20 08:32:41 UTC 2024
2
kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  4.4.4.0-24 -o yaml | grep -c podref
2
kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  5.5.5.0-24 -o yaml | grep -c podref
1
kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  6.6.6.0-24 -o yaml | grep -c podref
1
kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  7.7.7.0-24 -o yaml | grep -c podref
1
kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  8.8.8.0-24 -o yaml | grep -c podref
1
kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  9.9.9.0-24 -o yaml | grep -c podref
1

We can see one extra pod reference is visible in 2.2.2.0/24, 3.3.3.0/24 and 4.4.4.0/24 ranges. This will grow in case we do multiple scale up/down operation on deployment.

Expected behavior
All podReference of deleted pods should be removed from the list and keep visible only running pod's podReference. For example expected output in this case where only one pod is running after scale down should be:

date && kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  2.2.2.0-24 -o yaml | grep -c podref
Thu Jun 20 08:30:53 UTC 2024
1
date && kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  3.3.3.0-24 -o yaml | grep -c podref
Thu Jun 20 08:32:41 UTC 2024
1
kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  4.4.4.0-24 -o yaml | grep -c podref
1
kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  5.5.5.0-24 -o yaml | grep -c podref
1
kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  6.6.6.0-24 -o yaml | grep -c podref
1
kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  7.7.7.0-24 -o yaml | grep -c podref
1
kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  8.8.8.0-24 -o yaml | grep -c podref
1
kubectl get ippools.whereabouts.cni.cncf.io -n kube-system  9.9.9.0-24 -o yaml | grep -c podref
1

To Reproduce
Steps to reproduce the behavior:

Test with the whereabouts own test with kind by running make kind (1 control plane and 2 worker)
Apply the NetworkAttachmentDefinition with different range such as range1, range2....range8
Apply a deployment and scale it to 200
Scale down the deployment to 1. Wait for all pods to get removed completely. Pods stays hanging long time in Terminating state and get eventually removed
Then check ippools using the following command: kubectl get ippools.whereabouts.cni.cncf.io -n kube-system 9.9.9.0-24 -o yaml | grep -c podref , we can see extra podReference visible which should get removed after deletion of a pod.

Environment:

Whereabouts version : Latest
Kubernetes version (use kubectl version): 1.30
Network-attachment-definition: 8 ranges applied from 2.2.2.0/24-9.9.9.0/24
Whereabouts configuration (on the host): N/A
OS (e.g. from /etc/os-release): N/A
Kernel (e.g. uname -a): N/A
Others: N/A

Additional info / context
Add any other information / context about the problem here.

The text was updated successfully, but these errors were encountered:

smoshiur1237 · 2024-06-20T12:40:15Z

/cc @mlguerrero12

adilGhaffarDev · 2024-07-17T10:11:44Z

@mlguerrero12 any update regarding this issue?

mlguerrero12 · 2024-07-17T10:14:32Z

no, I haven´t worked on this. Will do soon.

pallavi-mandole · 2024-08-22T09:16:56Z

any update on this issue?
We faced this issue again in our setup.

adilGhaffarDev · 2025-02-03T13:14:48Z

I have done some investigation. So the issue is happening because leader election fails when we are calling cleanupFunc(which is IPManagement)

whereabouts/pkg/controlloop/pod.go

Line 237 in c4d2f71

    
           if _, err := pc.cleanupFunc(context.TODO(), types.Deallocate, *ipamConfig, wbClient); err != nil {

It fails with this error:

 31 leaderelection.go:336] error initially creating leader election record : the server does not allow this method on the requested resource

Seems like garbageCollectPodIPs is not working properly because any pod for which I see the following log, has left some pod-refs after it is deleted:
[verbose] stale allocation to cleanup: {ContainerID:b75419a947086fc3e0243193a8c32111397cba337c10b6424986109e9caab3d4 PodRef:default/n-dep-9dc75fc9b-jgfhp IfName:net4}

cc @maiqueb since this was committed by you some help would be appreciated.
I am also trying to understand the code and debug more.

cc @dougbtv @s1061123 please check.

adilGhaffarDev · 2025-02-06T13:32:22Z

/assign @adilGhaffarDev
assigning this to myself, I have opened a PR please check. #556

smoshiur1237 mentioned this issue Jun 20, 2024

[BUG] Failed to create the reconcile looper: failed to list all OverLappingIPs: client rate limiter Wait returned an error: context deadline exceeded #389

Closed

mlguerrero12 self-assigned this Jun 20, 2024

adilGhaffarDev mentioned this issue Jan 28, 2025

[BUG] podRefs are leftover when large number of pods are scaled down. #553

Closed

adilGhaffarDev linked a pull request Feb 6, 2025 that will close this issue

🐛 Fixes leftover podref issue #556

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Scale down of deployment left pod references undeleted #483

[BUG] Scale down of deployment left pod references undeleted #483

smoshiur1237 commented Jun 20, 2024

smoshiur1237 commented Jun 20, 2024

adilGhaffarDev commented Jul 17, 2024

mlguerrero12 commented Jul 17, 2024

pallavi-mandole commented Aug 22, 2024

adilGhaffarDev commented Feb 3, 2025

adilGhaffarDev commented Feb 6, 2025

[BUG] Scale down of deployment left pod references undeleted #483

[BUG] Scale down of deployment left pod references undeleted #483

Comments

smoshiur1237 commented Jun 20, 2024

smoshiur1237 commented Jun 20, 2024

adilGhaffarDev commented Jul 17, 2024

mlguerrero12 commented Jul 17, 2024

pallavi-mandole commented Aug 22, 2024

adilGhaffarDev commented Feb 3, 2025

adilGhaffarDev commented Feb 6, 2025