Skip to content

Conversation

@camrynl
Copy link
Contributor

@camrynl camrynl commented May 1, 2025

Reason for Change:

Update release testing ci. Restart cilium daemonset after node restarts to clean up old states/endpoints before entering state file check.

Issue Fixed:

Requirements:

Notes:

@camrynl camrynl added the ci Infra or tooling. label May 1, 2025
@Copilot Copilot AI review requested due to automatic review settings May 1, 2025 22:17
@camrynl camrynl requested a review from a team as a code owner May 1, 2025 22:17
@camrynl camrynl requested a review from snguyen64 May 1, 2025 22:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@camrynl
Copy link
Contributor Author

camrynl commented May 1, 2025

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

done
fi
# Restart cilium if it is installed, bpf maps and endpoint states can be stale after a node restart (versions < v1.17)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If node is restarting won't that automatically restart Cilium ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes cilium is initially restarted but somehow the old endpoints/states are not all reset.
The validate state scenario is failing and the clusters have 1-2 pods marked Unknown and their cilium endpoints are left waiting-for-identity .

In agent logs it shows it seems to be stuck in a create/delete loop.

time="2025-04-30T16:54:27Z" level=warning msg="Cancelled endpoint create request due to receiving endpoint delete request" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=daemon
time="2025-04-30T16:54:27Z" level=warning msg="Unable to release endpoint ID" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x cniAttachmentID="d94a49c4398b79442842dfc1d17793d6a9abf9dc5e618f769a62123fc5d9b0e3:eth0" error="Unable to release endpoint ID 219" state=waiting-for-identity subsys=endpoint-manager
time="2025-04-30T16:54:27Z" level=info msg="Removed endpoint" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=endpoint
time="2025-04-30T16:54:27Z" level=warning msg="Ignoring error while deleting endpoint" endpointID=219 error="Unable to delete key 192.168.7.61:0 from /sys/fs/bpf/tc/globals/cilium_lxc: unable to delete element 192.168.7.61:0 from map cilium_lxc: delete: key does not exist" subsys=daemon
time="2025-04-30T16:54:27Z" level=warning msg="Error changing endpoint identity" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 error="unable to resolve identity: exponential backoff cancelled via context: context canceled" identityLabels="k8s:io.cilium.k8s.namespace.labels.addonmanager.kubernetes.io/mode=Reconcile,k8s:io.cilium.k8s.namespace.labels.control-plane=true,k8s:io.cilium.k8s.namespace.labels.kubernetes.azure.com/managedby=aks,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/cluster-service=true,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=metrics-server,k8s:io.kubernetes.pod.namespace=kube-system,k8s:k8s-app=metrics-server,k8s:kubernetes.azure.com/managedby=aks" ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=endpoint

There is attempt to delete the endpoint but it doesn't exist in the map.. so the error is later ignored and it keeps reattempting to create endpoint again.

When I delete the broken pods or restart cilium the issue goes away and endpoints are restored

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we investigate why the pod status is Unknown ? Concern is we are now restarting CA two times to make the test happy ?. There might be an underlying issue which we are missing here.

somehow the old endpoints/states are not all reset.

Weird looks like an issue in CA then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will investigate further. marking this change as draft for now

@camrynl camrynl marked this pull request as draft May 2, 2025 16:47
@github-actions
Copy link

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label May 17, 2025
@camrynl camrynl removed the stale Stale due to inactivity. label May 21, 2025
@github-actions
Copy link

github-actions bot commented Jun 5, 2025

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Jun 5, 2025
@github-actions
Copy link

Pull request closed due to inactivity.

@github-actions github-actions bot closed this Jun 13, 2025
@github-actions github-actions bot deleted the camrynl/restartnode branch June 13, 2025 00:02
@rbtr rbtr restored the camrynl/restartnode branch June 23, 2025 17:54
@rbtr rbtr reopened this Jun 23, 2025
@rbtr rbtr added exempt-stale Keep this fresh and removed stale Stale due to inactivity. labels Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Infra or tooling. exempt-stale Keep this fresh

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants