-
Notifications
You must be signed in to change notification settings - Fork 8
Introduce Leader Election and use Deployments for agents #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Introduce Leader Election and use Deployments for agents #61
Conversation
This pull request is now in conflict. Could you fix it @davidepasquero? 🙏 |
Signed-off-by: davidepasquero <[email protected]>
Signed-off-by: davidepasquero <[email protected]>
Signed-off-by: davidepasquero <[email protected]>
This change introduces leader election to the VM DHCP agent, mirroring the mechanism used in the controller. This ensures that only one instance of the agent is active at a time, preventing potential conflicts and ensuring stable operation in a clustered environment. The implementation utilizes the `github.com/rancher/wrangler/pkg/leader` package. A `noLeaderElection` flag has been added to the agent's command-line options to allow bypassing leader election for development or testing purposes. Signed-off-by: davidepasquero <[email protected]>
Refactors the Helm chart to deploy the VM DHCP agent as a separate Kubernetes Deployment, rather than being managed directly by the controller. This change aligns with standard Kubernetes practices and improves resilience and manageability. Key changes include: - Added a new Deployment manifest for the agent (`agent-deployment.yaml`). - Introduced dedicated RBAC resources (ServiceAccount, ClusterRole, ClusterRoleBinding) for the agent to enable leader election and dynamic IPPool discovery. - Updated `values.yaml` to provide configuration options for the new agent deployment and its associated resources. - Modified the controller's Deployment manifest to remove the direct management and launching of the agent. This chart refactoring is a prerequisite for upcoming Go code changes in the agent to implement IPPool watching and in the controller to remove agent process management. The agent will now rely on its own leader election mechanism, implemented in a previous change. Signed-off-by: davidepasquero <[email protected]>
Signed-off-by: davidepasquero <[email protected]>
This commit removes the controller's responsibility for creating, managing, and monitoring dedicated agent pods for IPPools. This functionality is being shifted to a separate agent deployment. Changes include: - Removed DeployAgent and MonitorAgent logic from IPPool controller. - Deleted status.agentPodRef and AgentReady condition from IPPool types. - Removed CLI flags and configuration options related to agent images, namespaces, service accounts, and the no-dhcp flag for agents. - Cleaned up associated helper functions, structs, and unused code. Signed-off-by: davidepasquero <[email protected]>
Signed-off-by: davidepasquero <[email protected]>
Signed-off-by: davidepasquero <[email protected]>
Signed-off-by: davidepasquero <[email protected]>
Signed-off-by: davidepasquero <[email protected]>
The agent was previously hardcoded to use the 'kube-system' namespace for its leader election lock. This caused permission errors as the agent's ServiceAccount was not authorized for that namespace. This commit changes the agent to use its own namespace (obtained via the Downward API through the POD_NAMESPACE environment variable) for leader election. The Helm chart has been updated to provide this environment variable to the agent deployment. Signed-off-by: davidepasquero <[email protected]>
This commit updates the labels in `agent-clusterrole.yaml` to use the common labels helper and override the component, making it consistent with the approach used in `agent-clusterrolebinding.yaml`. This is a speculative change unlikely to fix a persistent 'forbidden' error if RBAC rules were otherwise correctly defined and applied, but it improves chart consistency. Signed-off-by: davidepasquero <[email protected]>
The webhook was failing TLS verification because its certificate was not valid for the FQDN the API server was using (e.g., `webhook-name.namespace.svc`). This was due to the underlying `harvester/webhook` library not receiving the correct namespace during certificate generation, leading to a malformed SAN (e.g., `webhook-name..svc`). This commit adds the `--namespace {{ .Release.Namespace }}` argument to the webhook container startup, ensuring the certificate is generated with the correct SANs. Note: A potential issue was observed where the webhook deployment specifies a ServiceAccount name (`{{fullname}}-webhook-webhook`) that might not be explicitly created by the chart. This may need further investigation if webhook pods encounter permission issues or fail to start due to SA problems, though it's unrelated to the certificate SAN issue. Signed-off-by: davidepasquero <[email protected]>
This commit changes the metadata.name for the agent's ClusterRole and ClusterRoleBinding to use a simpler naming convention based directly on `{{ .Release.Name }}`. This is an attempt to eliminate potential issues from complex name templating for these RBAC resources, in response to persistent 'configmaps forbidden' errors for the agent. The ServiceAccount name itself remains unchanged as it correctly matches the user identified in the error logs. Signed-off-by: davidepasquero <[email protected]>
This commit refactors the system to allow the main vm-dhcp-controller to dynamically update the vm-dhcp-agent's Deployment based on IPPool resources. When an IPPool is detected, the controller will now: 1. Parse `IPPool.spec.networkName` to determine the target NetworkAttachmentDefinition (NAD). 2. Update the agent Deployment's pod template to include the `k8s.v1.cni.cncf.io/networks` annotation pointing to this NAD. 3. Update the `--nic` argument of the agent container to use the interface name associated with the Multus attachment (e.g., `net1`). Changes include: - Added RBAC permissions for the controller to get, list, watch, and patch Deployments in its namespace. - Modified `pkg/controller/ippool/controller.go` to implement the logic for fetching and updating the agent Deployment. - Updated `pkg/config/context.go` and `cmd/controller/run.go` to provide necessary KubeClient and namespace information to the controller logic. - Reverted previous static Multus configuration from `values.yaml` and the agent's Helm template, as this is now handled dynamically. The controller relies on an `AGENT_DEPLOYMENT_NAME` environment variable (to be set in its own deployment manifest) to accurately locate the agent deployment for updates. Signed-off-by: davidepasquero <[email protected]>
This commit addresses several Go build errors in `pkg/controller/ippool/controller.go`: - Removed unused imports for `k8s.io/api/apps/v1` (appsv1) and `encoding/json`. - Added the missing import for the `os` package. - Modified the `getAgentDeploymentName` function to not accept arguments, as its implementation uses `os.Getenv` and did not use the parameter. - Updated the call to `getAgentDeploymentName` accordingly. Signed-off-by: davidepasquero <[email protected]>
Removes a duplicate import of `k8s.io/apimachinery/pkg/apis/meta/v1` aliased as `metav1` in `pkg/controller/ippool/controller.go` to fix the 'metav1 redeclared in this block' build error. Signed-off-by: davidepasquero <[email protected]>
This commit extends the dynamic agent deployment modification logic. When the IPPool controller reconciles an IPPool and updates the agent deployment for NAD attachment (Multus annotation and --nic arg), it now also updates the `IPPOOL_REF` environment variable in the agent's container. This ensures that the agent is not only connected to the correct network but is also configured to manage and serve IPs for the intended IPPool. Changes: - Modified `syncAgentDeployment` in `pkg/controller/ippool/controller.go` to find/update or add the `IPPOOL_REF` env var. - Added `corev1` import for `corev1.EnvVar`. - Ensured `IPPOOL_REF` is defined in the agent deployment template (`chart/templates/agent-deployment.yaml`) so the controller can update it. Signed-off-by: davidepasquero <[email protected]>
This commit removes the `envVarUpdated` variable from the `syncAgentDeployment` function in `pkg/controller/ippool/controller.go` as it was declared but its value was not subsequently used. The `needsUpdate` flag already correctly handles the logic for determining if the agent deployment requires an update. Signed-off-by: davidepasquero <[email protected]>
This commit addresses two issues preventing the controller from dynamically updating the agent deployment: 1. Agent Container Name: The controller now uses an `AGENT_CONTAINER_NAME` environment variable (set to `{{ .Chart.Name }}-agent` in its own deployment manifest) to reliably find the agent container within the agent's deployment. This replaces a hardcoded constant that could be mismatched. 2. Deployment Update RBAC: The controller's Role for managing deployments was granted the `update` verb in addition to `patch`. This resolves the RBAC error encountered when the client-go library attempts to update the agent deployment resource. Signed-off-by: davidepasquero <[email protected]>
… log This commit corrects a log message in `pkg/controller/ippool/controller.go` that was still referencing the removed constant `AgentContainerNameDefault`. The log message now correctly uses the `h.getAgentContainerName()` method to get the agent container name, resolving the build error. Signed-off-by: davidepasquero <[email protected]>
This commit implements the functionality for the DHCP agent to self-configure a static IP address (the `serverIP` from the IPPool) on its Multus-attached network interface. This allows the DHCP server running in the agent to use the correct source IP for its replies. Changes include: - Fixed `--nic` argument update logic in the IPPool controller to prevent malformed arguments in the agent deployment. - Added `--server-ip` and `--cidr` command-line flags to the agent. - Updated `config.AgentOptions` to include these new fields. - Modified the IPPool controller to extract `serverIP` and `cidr` from the `IPPool` spec and pass them as arguments to the agent deployment. - Granted `NET_ADMIN` capability to the agent container to allow network interface modification. - Implemented `configureInterface` method in `pkg/agent/agent.go` which uses `ip address add` and `ip link set up` commands to assign the static IP to the agent's target NIC before the DHCP server starts. Signed-off-by: davidepasquero <[email protected]>
Removes a duplicate and misplaced import block that caused a `syntax error: imports must appear before other declarations` build error. The necessary packages were already correctly imported in the main import block at the top of the file. Signed-off-by: davidepasquero <[email protected]>
This commit addresses an issue where a newly started/leader DHCP agent pod might report "NO LEASE FOUND" for existing leases. This occurred because the DHCP server could start before its internal lease cache was fully synchronized with the IPPool custom resource's status. Changes: 1. Introduced an `InitialSyncDone` channel in the agent's IPPool event handler (`pkg/agent/ippool/event.go` and its local `controller.go`). 2. The local controller now populates the `DHCPAllocator`'s leases from the `IPPool.Status.IPv4.Allocated` map and then signals on the `InitialSyncDone` channel. 3. The main agent logic (`pkg/agent/agent.go`) now waits for this signal before starting the DHCP server, ensuring the cache is primed. 4. The cache population logic in the local controller includes handling for adding new leases and removing stale ones based on the IPPool status. Signed-off-by: davidepasquero <[email protected]>
…ller This commit addresses several build errors in `pkg/agent/ippool/controller.go` and corrects logic within its `Update` method: - Added missing imports for `sync` and `github.com/harvester/vm-dhcp-controller/pkg/util`. - Removed an unused variable `currentLeasesInAllocator`. - Corrected the logic for checking lease existence using `c.dhcpAllocator.GetLease()`. - Ensured correct handling of `LeaseTime` type conversion for `dhcpAllocator.AddLease()`. - Resolved method redeclaration error by deleting the duplicate file `pkg/agent/ippool/ippool.go`. These changes aim to fix the build and improve the correctness of the DHCP lease cache synchronization logic within the agent. Signed-off-by: davidepasquero <[email protected]>
This commit addresses multiple build errors in the agent's local ippool controller (`pkg/agent/ippool/controller.go`) and refines its lease synchronization logic: - Corrected `LeaseTime` type handling when calling `dhcpAllocator.AddLease`, passing `specConf.LeaseTime` directly as it's assumed to be `*int` based on compiler errors. - Added missing `sync` and `util` package imports. - Removed unused variables. - Corrected the logic for checking lease existence before deletion. - Deleted the conflicting `pkg/agent/ippool/ippool.go` file that caused a method redeclaration error. These changes are intended to fix the build and ensure the agent's DHCP lease cache is correctly populated from the IPPool status, particularly after failovers or restarts. Signed-off-by: davidepasquero <[email protected]>
1981a89
to
51ff3a8
Compare
ContextThis pull request implements the approach discussed in the previous PR #54 (comment) and aims to provide a robust, Kubernetes-native solution to the agent lifecycle management problem. Problem SolvedThe previous agent management via single Pods had a critical flaw: in the event of a node crash, the agent pod would get stuck in a "Terminating" state on that node. This prevented it from being rescheduled onto a healthy node without manual intervention, causing a service disruption. Implemented SolutionFollowing the feedback received, this PR introduces two fundamental changes:
Benefits of this Approach
|
Key Technical Changes1. Fixed Agent Leader Election Namespace
2. Fixed Webhook Certificate SAN Mismatch
3. Dynamic Agent Network Interface Management (Multus NAD Attachment)
4. Agent DHCP Lease Cache Synchronization on Startup/Failover
5. Miscellaneous Build Fixes
|
81975d5
to
6cb2aca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @davidepasquero. Could you help organize your commits and provide a concise summary of what this PR is trying to achieve and how it works? Thank you.
Hi,
Certainly, please allow me a little time. I will begin working on this next
week.
Thank you very much.
Davide
Il giorno mer 9 lug 2025 alle ore 08:03 Zespre Chang <
***@***.***> ha scritto:
… ***@***.**** commented on this pull request.
Hi @davidepasquero <https://github.com/davidepasquero>. Could you help
organize your commits and provide a concise summary of what this PR is
trying to achieve and how it works? Thank you.
—
Reply to this email directly, view it on GitHub
<#61 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANZRW3D67S7ELFO2QSHE7ST3HSWDDAVCNFSM6AAAAACAYCIQYSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTAMBQGA4DMNJYGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
The agents have been changed from pod to deployment so that their life cycle is managed by Kubernetes