Skip to content

OCPBUGS-56274: add datacenter consistency check#212

Open
RomanBednar wants to merge 1 commit intoopenshift:mainfrom
RomanBednar:OCPBUGS-56274
Open

OCPBUGS-56274: add datacenter consistency check#212
RomanBednar wants to merge 1 commit intoopenshift:mainfrom
RomanBednar:OCPBUGS-56274

Conversation

@RomanBednar
Copy link
Copy Markdown
Contributor

When using zonal deployments of vSphere with OpenShift, if a datacenter referenced by a failure domain in the Infrastructure CR (infrastructure.config.openshift.io/cluster) is missing from the cloud provider config (cloud-provider-config ConfigMap in openshift-config), the CSI driver silently fails to find VMs in that zone, causing the cluster to degrade. The vSphere Problem Detector (VPD) had no check to detect this misconfiguration. This fix adds a new cluster-level check, CheckDatacenterConsistency, that compares each failure domain's required datacenter against the datacenters listed in the parsed cloud.conf (ctx.VMConfig.Config.VirtualCenter[server].Datacenters). When a datacenter is absent, VPD emits a WARNING naming the missing datacenter, the affected failure domain, and instructs the administrator to update the cloud-provider-config ConfigMap in the openshift-config namespace.

Cluster Setup

Two failure domains configured:

  • us-east-1 → datacenter nested-devqedatacenter-1
  • us-west-1 → datacenter nested-devqedatacenter-2

Both on vCenter 232-15-184-10.in-addr.arpa.

Simulating the Bug

The datacenter nested-devqedatacenter-2 was removed from cloud-provider-config:

# Edit cloud-provider-config to remove nested-devqedatacenter-2
oc -n openshift-config edit configmap cloud-provider-config
# Changed: datacenters = nested-devqedatacenter-1,nested-devqedatacenter-2
# To:      datacenters = nested-devqedatacenter-1

# Verified propagation to vsphere-csi-config-secret:
oc -n openshift-cluster-csi-drivers get secret/vsphere-csi-config-secret \
  -o jsonpath='{.data.cloud\.conf}' | base64 -d
# Output confirmed: datacenters = nested-devqedatacenter-1

Unpatched Behaviour (openshift/main)

export KUBECONFIG=/Users/MAC/openshift/clusters/vsphere/cluster-01/auth/kubeconfig
git checkout openshift/main && make
./vsphere-problem-detector start -v 5 \
  --kubeconfig=$KUBECONFIG \
  --namespace=openshift-cluster-storage-operator

Relevant log lines:

I0219 16:17:18.909862   17481 infra_config.go:15] Checking infrastructure and cloud provider config for consistency.
I0219 16:17:18.909897   17481 vsphere_check.go:302] CheckInfraConfig passed
I0219 16:17:24.169406   17481 vsphere_check.go:109] Finished running all vSphere specific checks in the cluster
I0219 16:17:24.307163   17481 event.go:377] ... type: 'Normal' reason: 'SucceededVSphereCheckInfraConfig' Check succeeded

No warning or error about the missing datacenter nested-devqedatacenter-2.

Patched Behaviour (OCPBUGS-56274)

git checkout OCPBUGS-56274 && make
./vsphere-problem-detector start -v 5 \
  --kubeconfig=$KUBECONFIG \
  --namespace=openshift-cluster-storage-operator

Relevant log lines:

I0219 16:23:24.680681   32885 datacenter_consistency.go:16] Checking datacenter consistency between failure domains and cloud provider config.
W0219 16:23:24.680821   32885 datacenter_consistency.go:50] Datacenter-Consistency: failure domain "us-west-1" (infrastructure.config.openshift.io/cluster) requires datacenter "nested-devqedatacenter-2" on vCenter "232-15-184-10.in-addr.arpa", but it is not listed in the cloud provider config (datacenters = "nested-devqedatacenter-1" in vsphere-csi-config-secret, namespace openshift-cluster-csi-drivers). Add "nested-devqedatacenter-2" to the datacenters list in the cloud-provider-config ConfigMap in the openshift-config namespace.
I0219 16:23:24.680835   32885 vsphere_check.go:299] CheckDatacenterConsistency failed: Datacenter-Consistency: failure domain "us-west-1" ...
I0219 16:23:30.292865   32885 event.go:377] ... type: 'Warning' reason: 'FailedVSphereCheckDatacenterConsistency' Datacenter-Consistency: failure domain "us-west-1" (infrastructure.config.openshift.io/cluster) requires datacenter "nested-devqedatacenter-2" on vCenter "232-15-184-10.in-addr.arpa" ...

WARNING emitted, explicitly naming nested-devqedatacenter-2 as missing, with remediation instructions.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When using zonal deployments of vSphere with OpenShift, if a datacenter referenced by a failure domain in the Infrastructure CR (infrastructure.config.openshift.io/cluster) is missing from the cloud provider config (cloud-provider-config ConfigMap in openshift-config), the CSI driver silently fails to find VMs in that zone, causing the cluster to degrade. The vSphere Problem Detector (VPD) had no check to detect this misconfiguration. This fix adds a new cluster-level check, CheckDatacenterConsistency, that compares each failure domain's required datacenter against the datacenters listed in the parsed cloud.conf (ctx.VMConfig.Config.VirtualCenter[server].Datacenters). When a datacenter is absent, VPD emits a WARNING naming the missing datacenter, the affected failure domain, and instructs the administrator to update the cloud-provider-config ConfigMap in the openshift-config namespace.

Cluster Setup

Two failure domains configured:

  • us-east-1 → datacenter nested-devqedatacenter-1
  • us-west-1 → datacenter nested-devqedatacenter-2

Both on vCenter 232-15-184-10.in-addr.arpa.

Simulating the Bug

The datacenter nested-devqedatacenter-2 was removed from cloud-provider-config:

# Edit cloud-provider-config to remove nested-devqedatacenter-2
oc -n openshift-config edit configmap cloud-provider-config
# Changed: datacenters = nested-devqedatacenter-1,nested-devqedatacenter-2
# To:      datacenters = nested-devqedatacenter-1

# Verified propagation to vsphere-csi-config-secret:
oc -n openshift-cluster-csi-drivers get secret/vsphere-csi-config-secret \
 -o jsonpath='{.data.cloud\.conf}' | base64 -d
# Output confirmed: datacenters = nested-devqedatacenter-1

Unpatched Behaviour (openshift/main)

export KUBECONFIG=/Users/MAC/openshift/clusters/vsphere/cluster-01/auth/kubeconfig
git checkout openshift/main && make
./vsphere-problem-detector start -v 5 \
 --kubeconfig=$KUBECONFIG \
 --namespace=openshift-cluster-storage-operator

Relevant log lines:

I0219 16:17:18.909862   17481 infra_config.go:15] Checking infrastructure and cloud provider config for consistency.
I0219 16:17:18.909897   17481 vsphere_check.go:302] CheckInfraConfig passed
I0219 16:17:24.169406   17481 vsphere_check.go:109] Finished running all vSphere specific checks in the cluster
I0219 16:17:24.307163   17481 event.go:377] ... type: 'Normal' reason: 'SucceededVSphereCheckInfraConfig' Check succeeded

No warning or error about the missing datacenter nested-devqedatacenter-2.

Patched Behaviour (OCPBUGS-56274)

git checkout OCPBUGS-56274 && make
./vsphere-problem-detector start -v 5 \
 --kubeconfig=$KUBECONFIG \
 --namespace=openshift-cluster-storage-operator

Relevant log lines:

I0219 16:23:24.680681   32885 datacenter_consistency.go:16] Checking datacenter consistency between failure domains and cloud provider config.
W0219 16:23:24.680821   32885 datacenter_consistency.go:50] Datacenter-Consistency: failure domain "us-west-1" (infrastructure.config.openshift.io/cluster) requires datacenter "nested-devqedatacenter-2" on vCenter "232-15-184-10.in-addr.arpa", but it is not listed in the cloud provider config (datacenters = "nested-devqedatacenter-1" in vsphere-csi-config-secret, namespace openshift-cluster-csi-drivers). Add "nested-devqedatacenter-2" to the datacenters list in the cloud-provider-config ConfigMap in the openshift-config namespace.
I0219 16:23:24.680835   32885 vsphere_check.go:299] CheckDatacenterConsistency failed: Datacenter-Consistency: failure domain "us-west-1" ...
I0219 16:23:30.292865   32885 event.go:377] ... type: 'Warning' reason: 'FailedVSphereCheckDatacenterConsistency' Datacenter-Consistency: failure domain "us-west-1" (infrastructure.config.openshift.io/cluster) requires datacenter "nested-devqedatacenter-2" on vCenter "232-15-184-10.in-addr.arpa" ...

WARNING emitted, explicitly naming nested-devqedatacenter-2 as missing, with remediation instructions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from dfajmon and mpatlasov February 19, 2026 15:41
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Feb 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RomanBednar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2026
@RomanBednar
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (wduan@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Feb 19, 2026

@RomanBednar: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@RomanBednar
Copy link
Copy Markdown
Contributor Author

/assign @gnufied

For review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants