OCPEDGE-2303: update test logic for degraded cluster run #30649

Neilhamza · 2026-01-05T11:27:44Z

while working on Two Node Fencing in a degraded mode, i ran e2e tests and 3 of these cert tests failed:

[sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]

the lane that runs TNF in a degraded mode can be found in the config:
https://github.com/openshift/release/blob/master/ci-operator/config/openshift/release/openshift-release-master__nightly-4.22.yaml#L587

an example run in which these 3 failed: (the other failed tests in this run has been handled already - testing locally i receive same timeout error however applying this change resulted in a succeed run)
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72974/rehearse-72974-periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ovn-two-node-fencing-degraded-techpreview/2005532130192396288

the reason these 3 failed because in the before all section in this test we were targeting "all control plane nodes" without considering their state (ready/degraded) so a small adjustment fixed that // without modifying any other logic in these tests

Signed-off-by: nhamza <[email protected]>

openshift-ci-robot · 2026-01-05T11:27:47Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2026-01-05T11:27:49Z

@Neilhamza: This pull request references OCPEDGE-2303 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

while working on Two Node Fencing in a degraded mode, i ran e2e tests and 3 of these cert tests failed:

[sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]

[sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]

the reason these 3 failed because in the before all section in this test we were targeting "all control plane nodes" without considering their state (ready/degraded) so a small adjustment fixed that // without modifying any other logic in these tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-01-05T11:28:00Z

@Neilhamza: This pull request references OCPEDGE-2303 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

while working on Two Node Fencing in a degraded mode, i ran e2e tests and 3 of these cert tests failed:

[sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]

the reason these 3 failed because in the before all section in this test we were targeting "all control plane nodes" without considering their state (ready/degraded) so a small adjustment fixed that // without modifying any other logic in these tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Neilhamza · 2026-01-05T12:03:49Z

/retest

Neilhamza · 2026-01-05T13:29:23Z

/retest

Neilhamza · 2026-01-05T14:36:25Z

/retest

jaypoulz

Overall, I like the change.
That said, I don't think we should skip the test if there are no ready control plane nodes. I believe no ready control plane nodes is a valid reason to fail and provide an error. That said, filtering out not-ready nodes is a valid case for degraded mode.

I would check in with the test authors to see if they prefer that stick with the current behavior for non-degraded mode tests. It's helpful for tests to fail if your cluster is in an unhealthy state, so we don't what to mask issues. We can even refine the adjustment so we only allow 1 control-plane node to be not-ready and only in degraded mode when using TNF topology.

dgoodwin · 2026-01-06T14:40:06Z

What suite are you running for this job putting the cluster into a degraded state, because from the sounds of it I would expect the standard suite to throw a whole pile of errors beyond just these. Can I see a job run where this occurred?

Neilhamza · 2026-01-06T14:53:47Z

What suite are you running for this job putting the cluster into a degraded state, because from the sounds of it I would expect the standard suite to throw a whole pile of errors beyond just these. Can I see a job run where this occurred?

@dgoodwin
we degrade the cluster manually via a workflow that we developed and yes we manually also skip a bunch of excpected tests to fail while also disabling some monitoring tests.
you can check the lane structure at - as: e2e-metal-ovn-two-node-fencing-degraded-techpreview
in the ci-operator/config/openshift/release/openshift-release-master__nightly-4.22.yaml

however these 3 tests aren't supposed to fail just because the cluster is degraded so a small adjustment could fix that

dgoodwin · 2026-01-06T14:59:41Z

Links for the config please @Neilhamza to aid the reviewers, and I could still use a link directly to a job run failing on these tests.

openshift-ci-robot · 2026-01-06T15:04:14Z

@Neilhamza: This pull request references OCPEDGE-2303 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

while working on Two Node Fencing in a degraded mode, i ran e2e tests and 3 of these cert tests failed:

[sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]

the lane that runs TNF in a degraded mode can be found in the config:
https://github.com/openshift/release/blob/master/ci-operator/config/openshift/release/openshift-release-master__nightly-4.22.yaml#L587

an example run in which these 3 failed:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72974/rehearse-72974-periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ovn-two-node-fencing-degraded-techpreview/2005532130192396288

the reason these 3 failed because in the before all section in this test we were targeting "all control plane nodes" without considering their state (ready/degraded) so a small adjustment fixed that // without modifying any other logic in these tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-01-06T15:05:41Z

@Neilhamza: This pull request references OCPEDGE-2303 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

while working on Two Node Fencing in a degraded mode, i ran e2e tests and 3 of these cert tests failed:

[sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]

the lane that runs TNF in a degraded mode can be found in the config:
https://github.com/openshift/release/blob/master/ci-operator/config/openshift/release/openshift-release-master__nightly-4.22.yaml#L587

an example run in which these 3 failed: (the other failed tests in this run has been handled already - testing locally i receive same timeout error however applying this change resulted in a succeed run)
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72974/rehearse-72974-periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ovn-two-node-fencing-degraded-techpreview/2005532130192396288

the reason these 3 failed because in the before all section in this test we were targeting "all control plane nodes" without considering their state (ready/degraded) so a small adjustment fixed that // without modifying any other logic in these tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Neilhamza · 2026-01-06T15:07:43Z

@dgoodwin thanks for the note, update the PR description to contain the relevant links

xueqzhan

Are you expecting this group of tests run in your job at all? I assume one of the nodes should be healthy. But I don't see the tests running in the subsequent jobs. For example: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72974/rehearse-72974-periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ovn-two-node-fencing-degraded-techpreview/2008094895578812416

xueqzhan · 2026-01-06T15:30:46Z

test/extended/operators/certs.go

 		// Skip metal jobs if test image pullspec cannot be determined
 		if jobType.Platform != "metal" || err == nil {
 			o.Expect(err).NotTo(o.HaveOccurred())
-			onDiskPKIContent, err = fetchOnDiskCertificates(ctx, kubeClient, oc.AdminConfig(), masters, openshiftTestImagePullSpec)


So the presence of unready masters on line 124 does not cause any issue?

i wondered that aswell! turns out The fact that one master is NotReady shouldn’t affect gatherCertsFromPlatformNamespaces in line 124. It collects all its data from the cluster API and uses the masters slice just for name-rewriting and cleanup. Unlike the on-disk collection path, it doesn’t pin any helper pods to nodes or reach out to kubelets.

Neilhamza · 2026-01-06T16:02:23Z

Are you expecting this group of tests run in your job at all? I assume one of the nodes should be healthy. But I don't see the tests running in the subsequent jobs. For example: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72974/rehearse-72974-periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ovn-two-node-fencing-degraded-techpreview/2008094895578812416

@xueqzhan in this specific run i skipped these 3 tests manually in the lane
but yes these tests are part of the e2e conformance so they should be ran on a degraded TNF cluster

dgoodwin · 2026-01-06T16:29:16Z

@Neilhamza I'm worried the skip tests mechanism you're using is too brittle, a single character change in test names is going to take out your jobs and fire regressions on the release board we monitor should TNF make it to blocking. I would encourage a more sophisticated approach within origin itself via a custom suite that largely mimics the main suites you use, but can exclude things more dynamically. TRT can help but I believe there's richer ginko test labelling available now that may be possible to use to limit tests, ideally label the tests you want skipped on TNF.

Neilhamza · 2026-01-07T10:26:07Z

@dgoodwin i like this approach i'll start investegating relevant labeling for these tests i am skipping
but that would require a new jira story/epic

in the mean time - these 3 tests i am targeting in this PR should not be skipped, so could i have your approval here please?
also can you please tell me if you agree with @jaypoulz comment here so i could update the code

thank you

Neilhamza · 2026-01-07T10:44:19Z

/retest

Signed-off-by: nhamza <[email protected]>

openshift-ci-robot · 2026-01-08T14:29:19Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

jaypoulz · 2026-01-08T14:38:43Z

@dgoodwin IMHO, the regex skip list approach is the best approach for these kinds of operations.
Introducing a new suite:

Changes the names of the existing tests, which can break other tests for the very reason you identified
Increases the number of tests suite definitions that need to be maintained. We want to run all relevant e2e-parallel tests, so it makes sense for that to be the e2e-parallel suite.

Introducing a new tag:

Changes the names of the existing tests, which can break other tests for the very reason you identified
Needs to be documented somewhere for others to use properly when adding new tests. To my knowledge, we don't have documentation for the existing tags besides the upstream ones, so this is already an issue.
Needs to be maintained over time, so we'll need to remove (or add) the tag was we fix these tests, which is more disruptive than modifying the regex skips (which are job specific).

Additionally, the concern about this breaking component readiness:

We (edge-payload-managers) want to identify breakages that affect TNF configurations. TNF is its own topology in component readiness, so this helps us find and respond to real issues.
There should already be ways to filter-out some topologies when examining regressions. If this is a concern at the release-readiness level, we should look into that. I know for a fact that this is what we did for multi-arch.

Finally, none of these jobs are linked to the release controller, so none of these results should affect payload acceptance.

jaypoulz · 2026-01-08T14:43:28Z

test/extended/operators/certs.go

+			topo, topoErr := exutil.GetControlPlaneTopology(oc)
+			o.Expect(topoErr).NotTo(o.HaveOccurred())
+
+			if *topo == configv1.DualReplicaTopologyMode {


I feel like we should be able to refine this even further. Maybe what we need here is to add a flag to openshift-tests that allows us to specificy that we are running the suite in degraded mode. We know we're in degraded mode because the test pod has an env var set. What we'd need to do is to change the invocation of the test suite so that we check for that env var and pass it to the test suite as extra configuration/context. Then this test could make an except on our topology only when that configuration is set.

pushed changes

Signed-off-by: nhamza <[email protected]>

jaypoulz

I'm happy with this. :) We should make a note in one of our team docs that is loaded into NotebookLM that we need to set the DEGRADED_NODE env var to test this locally.

jaypoulz · 2026-01-08T18:32:26Z

/lgtm

openshift-ci · 2026-01-08T18:33:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jaypoulz, Neilhamza
Once this PR has been reviewed and has the lgtm label, please assign petr-muller for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

update test logic for degraded tests

66455e1

Signed-off-by: nhamza <[email protected]>

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 5, 2026

openshift-ci bot requested review from bparees and sdodson January 5, 2026 11:28

jaypoulz reviewed Jan 6, 2026

View reviewed changes

xueqzhan reviewed Jan 6, 2026

View reviewed changes

Neilhamza requested a review from xueqzhan January 7, 2026 12:36

update test to keep logic as-is

3717fab

Signed-off-by: nhamza <[email protected]>

jaypoulz reviewed Jan 8, 2026

View reviewed changes

add env var gate for degraded mode

657fd3a

Signed-off-by: nhamza <[email protected]>

Neilhamza requested a review from jaypoulz January 8, 2026 16:38

jaypoulz approved these changes Jan 8, 2026

View reviewed changes

openshift-ci bot assigned jaypoulz Jan 8, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2026

OCPEDGE-2303: update test logic for degraded cluster run #30649

Are you sure you want to change the base?

OCPEDGE-2303: update test logic for degraded cluster run #30649

Conversation

Neilhamza commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jan 5, 2026

Uh oh!

openshift-ci-robot commented Jan 5, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]

Uh oh!

openshift-ci-robot commented Jan 5, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Neilhamza commented Jan 5, 2026

Uh oh!

Neilhamza commented Jan 5, 2026

Uh oh!

Neilhamza commented Jan 5, 2026

Uh oh!

jaypoulz left a comment

Choose a reason for hiding this comment

Uh oh!

dgoodwin commented Jan 6, 2026

Uh oh!

Neilhamza commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dgoodwin commented Jan 6, 2026

Uh oh!

openshift-ci-robot commented Jan 6, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jan 6, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Neilhamza commented Jan 6, 2026

Uh oh!

xueqzhan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xueqzhan Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Neilhamza Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Neilhamza commented Jan 6, 2026

Uh oh!

dgoodwin commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Neilhamza commented Jan 7, 2026

Uh oh!

Neilhamza commented Jan 7, 2026

Uh oh!

openshift-ci-robot commented Jan 8, 2026

Uh oh!

jaypoulz commented Jan 8, 2026

Uh oh!

jaypoulz Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Neilhamza Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

jaypoulz left a comment

Choose a reason for hiding this comment

Uh oh!

jaypoulz commented Jan 8, 2026

Uh oh!

openshift-ci bot commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Neilhamza commented Jan 5, 2026 •

edited

Loading

openshift-ci-robot commented Jan 5, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 5, 2026 •

edited by openshift-ci bot

Loading

Neilhamza commented Jan 6, 2026 •

edited

Loading

openshift-ci-robot commented Jan 6, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 6, 2026 •

edited by openshift-ci bot

Loading

xueqzhan left a comment •

edited

Loading

dgoodwin commented Jan 6, 2026 •

edited

Loading