Fix MinorUpdateOVNDataplane race when OVN image unchanged between versions#1864
Fix MinorUpdateOVNDataplane race when OVN image unchanged between versions#1864stuggi wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
Conversation
OpenStackControlPlane CRD Size Report
Threshold reference
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abays, stuggi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hmm, looks like the minor update got stuck: Let's try again and see if it reproduces. /test openstack-operator-build-deploy-kuttl-4-18 |
/test openstack-operator-build-deploy-kuttl-4-18 |
Same error as before: Might be a legitimate issue? |
Looks like a problem with the data plane reconciling an I didn't realize we used Metal3 provisioning in our Prow jobs. Is something messing up that is making the |
rabi
left a comment
There was a problem hiding this comment.
This is getting really messy 😄 We do create nodset/deployment to validate the CRs[1] after update. Probably we need to move the edpm_deploy_cleanup before we check for openstackctlplane ready in L195.
| serviceCache := make(map[string]*dataplanev1.OpenStackDataPlaneService) | ||
|
|
||
| for _, deployment := range deployments.Items { | ||
| // Skip completed deployments |
There was a problem hiding this comment.
There could be earlier failed deployments containing ovn. We should check if there is a running deployment. Also, I think probably a would still be a race here.
| } | ||
|
|
||
| if slices.Contains(svc.Spec.ContainerImageFields, containerImageField) { | ||
| return true, nil |
There was a problem hiding this comment.
We can probably just check serviceType 'ovn' and avoid the harcoded check for the image name.
There was a problem hiding this comment.
is there always a service type, which is stable and not custom?
There was a problem hiding this comment.
yes, serviceType is always fixed and the service name can change.
|
New changes are detected. LGTM label has been removed. |
|
I have updated the logic which I think addresses checking if the job runs/was running. re the ci issue, wondering if we should stop testing edpm updates in prow as we now have the update job in zuul job which has a proper edpm node? |
Yeah we can drop that if we want. |
…sions Commit 63fd705 removed the nodeset.IsReady() check from DataplaneNodesetsOVNControllerImagesMatch to fix the minor update workflow getting stuck when unrelated deployments were running. That check was too strict — it blocked when any deployment was in progress, even if the OVN update had already completed. However, the remaining pure image comparison is too loose. When the OVN controller image does not change between two OpenStack versions, the nodeset already has the matching image from the previous update. The OpenStackVersion controller then sets MinorUpdateOVNDataplane=True immediately, before the edpm-ovn-update deployment finishes. This causes the subsequent minor update steps (controlplane update, edpm services update) to proceed while the OVN dataplane deployment is still running — resulting in both dataplane deployments running concurrently. Fix this with two mechanisms: 1. IsDataplaneDeploymentRunningForServiceType: a new generic function that lists all in-progress OpenStackDataPlaneDeployment resources, resolves which services each deploys (from ServicesOverride or the nodeset's service list), and checks the service's EDPMServiceType to identify deployments of a given type (e.g. "ovn"). EDPMServiceType is a fixed identifier on the service spec, independent of the service or deployment resource name. 2. Saved condition state tracking: when the OVN image is unchanged between the deployed and target versions, image comparison alone cannot distinguish "already updated" from "not yet updated." To handle this, we use the saved condition state (captured before Init resets conditions each reconciliation) to track whether a running OVN deployment has been observed during this update cycle: - Running OVN deployment seen → set condition False(RequestedReason) - No running deployment + previous condition was RequestedReason → deployment completed, proceed - No running deployment + previous condition was NOT RequestedReason → deployment not created yet, keep waiting The OpenStackVersion controller watches nodesets, so when a deployment starts (nodeset becomes not-Ready) and completes (nodeset becomes Ready), reconciliation is triggered, ensuring we observe the running deployment at least once. When the OVN image differs between versions, the existing image match check is sufficient — the nodeset's ContainerImages are only updated on successful deployment completion, so a match proves the deployment ran. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Martin Schuppert <mschuppert@redhat.com>
| var services []string | ||
| if len(deployment.Spec.ServicesOverride) != 0 { | ||
| services = deployment.Spec.ServicesOverride | ||
| } else { | ||
| services = nodeset.Spec.Services | ||
| } |
There was a problem hiding this comment.
Tangential, but @rabi I wonder if this should be propagated into the deployment.Status? Seems like a pretty low cost way to avoid this if / else dance being added for a third time?
❯ rg deployment.Spec.ServicesOverride
internal/controller/dataplane/openstackdataplanenodeset_controller.go
605: if len(deployment.Spec.ServicesOverride) != 0 {
606: services = deployment.Spec.ServicesOverride
internal/dataplane/util/ansible_execution.go
280: if len(deployment.Spec.ServicesOverride) > 0 {
281: a.ExtraVars["edpm_services_override"] = json.RawMessage([]byte(fmt.Sprintf("\"%s\"", deployment.Spec.ServicesOverride)))
In that first one, you could set it to instance.Status.Services.
Commit 63fd705 removed the nodeset.IsReady() check from
DataplaneNodesetsOVNControllerImagesMatch to fix the minor update
workflow getting stuck when unrelated deployments were running. That
check was too strict — it blocked when any deployment was in progress,
even if the OVN update had already completed.
However, the remaining pure image comparison is too loose. When the
OVN controller image does not change between two OpenStack versions,
the nodeset already has the matching image from the previous update.
The OpenStackVersion controller then sets MinorUpdateOVNDataplane=True
immediately, before the edpm-ovn-update deployment finishes. This
causes the subsequent minor update steps (controlplane update, edpm
services update) to proceed while the OVN dataplane deployment is
still running — resulting in both dataplane deployments running
concurrently.
Fix this with two mechanisms:
IsDataplaneDeploymentRunningForServiceType: a new generic function
that lists all in-progress OpenStackDataPlaneDeployment resources,
resolves which services each deploys (from ServicesOverride or the
nodeset's service list), and checks the service's EDPMServiceType
to identify deployments of a given type (e.g. "ovn").
EDPMServiceType is a fixed identifier on the service spec,
independent of the service or deployment resource name.
Saved condition state tracking: when the OVN image is unchanged
between the deployed and target versions, image comparison alone
cannot distinguish "already updated" from "not yet updated." To
handle this, we use the saved condition state (captured before
Init resets conditions each reconciliation) to track whether a
running OVN deployment has been observed during this update cycle:
deployment completed, proceed
RequestedReason → deployment not created yet, keep waiting
The OpenStackVersion controller watches nodesets, so when a
deployment starts (nodeset becomes not-Ready) and completes
(nodeset becomes Ready), reconciliation is triggered, ensuring
we observe the running deployment at least once.
When the OVN image differs between versions, the existing image match
check is sufficient — the nodeset's ContainerImages are only updated
on successful deployment completion, so a match proves the deployment
ran.