Skip to content

Conversation

@Neilhamza
Copy link
Contributor

@Neilhamza Neilhamza commented Jan 5, 2026

while working on Two Node Fencing in a degraded mode, i ran e2e tests and 3 of these cert tests failed:

[sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]

 

the lane that runs TNF in a degraded mode can be found in the config:
https://github.com/openshift/release/blob/master/ci-operator/config/openshift/release/openshift-release-master__nightly-4.22.yaml#L587

an example run in which these 3 failed: (the other failed tests in this run has been handled already - testing locally i receive same timeout error however applying this change resulted in a succeed run)
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72974/rehearse-72974-periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ovn-two-node-fencing-degraded-techpreview/2005532130192396288

the reason these 3 failed because in the before all section in this test we were targeting "all control plane nodes" without considering their state (ready/degraded) so a small adjustment fixed that // without modifying any other logic in these tests

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 5, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 5, 2026

@Neilhamza: This pull request references OCPEDGE-2303 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

while working on Two Node Fencing in a degraded mode, i ran e2e tests and 3 of these cert tests failed:

[sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]

[sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]

 

the reason these 3 failed because in the before all section in this test we were targeting "all control plane nodes" without considering their state (ready/degraded) so a small adjustment fixed that // without modifying any other logic in these tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 5, 2026

@Neilhamza: This pull request references OCPEDGE-2303 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

while working on Two Node Fencing in a degraded mode, i ran e2e tests and 3 of these cert tests failed:

[sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]

 

the reason these 3 failed because in the before all section in this test we were targeting "all control plane nodes" without considering their state (ready/degraded) so a small adjustment fixed that // without modifying any other logic in these tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from bparees and sdodson January 5, 2026 11:28
@Neilhamza
Copy link
Contributor Author

/retest

2 similar comments
@Neilhamza
Copy link
Contributor Author

/retest

@Neilhamza
Copy link
Contributor Author

/retest

Copy link
Contributor

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I like the change.
That said, I don't think we should skip the test if there are no ready control plane nodes. I believe no ready control plane nodes is a valid reason to fail and provide an error. That said, filtering out not-ready nodes is a valid case for degraded mode.

I would check in with the test authors to see if they prefer that stick with the current behavior for non-degraded mode tests. It's helpful for tests to fail if your cluster is in an unhealthy state, so we don't what to mask issues. We can even refine the adjustment so we only allow 1 control-plane node to be not-ready and only in degraded mode when using TNF topology.

@dgoodwin
Copy link
Contributor

dgoodwin commented Jan 6, 2026

What suite are you running for this job putting the cluster into a degraded state, because from the sounds of it I would expect the standard suite to throw a whole pile of errors beyond just these. Can I see a job run where this occurred?

@Neilhamza
Copy link
Contributor Author

Neilhamza commented Jan 6, 2026

What suite are you running for this job putting the cluster into a degraded state, because from the sounds of it I would expect the standard suite to throw a whole pile of errors beyond just these. Can I see a job run where this occurred?

@dgoodwin
we degrade the cluster manually via a workflow that we developed and yes we manually also skip a bunch of excpected tests to fail while also disabling some monitoring tests.
you can check the lane structure at - as: e2e-metal-ovn-two-node-fencing-degraded-techpreview
in the ci-operator/config/openshift/release/openshift-release-master__nightly-4.22.yaml

however these 3 tests aren't supposed to fail just because the cluster is degraded so a small adjustment could fix that

@dgoodwin
Copy link
Contributor

dgoodwin commented Jan 6, 2026

Links for the config please @Neilhamza to aid the reviewers, and I could still use a link directly to a job run failing on these tests.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 6, 2026

@Neilhamza: This pull request references OCPEDGE-2303 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

while working on Two Node Fencing in a degraded mode, i ran e2e tests and 3 of these cert tests failed:

[sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]

 

the lane that runs TNF in a degraded mode can be found in the config:
https://github.com/openshift/release/blob/master/ci-operator/config/openshift/release/openshift-release-master__nightly-4.22.yaml#L587

an example run in which these 3 failed:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72974/rehearse-72974-periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ovn-two-node-fencing-degraded-techpreview/2005532130192396288

the reason these 3 failed because in the before all section in this test we were targeting "all control plane nodes" without considering their state (ready/degraded) so a small adjustment fixed that // without modifying any other logic in these tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 6, 2026

@Neilhamza: This pull request references OCPEDGE-2303 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

while working on Two Node Fencing in a degraded mode, i ran e2e tests and 3 of these cert tests failed:

[sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]

 

the lane that runs TNF in a degraded mode can be found in the config:
https://github.com/openshift/release/blob/master/ci-operator/config/openshift/release/openshift-release-master__nightly-4.22.yaml#L587

an example run in which these 3 failed: (the other failed tests in this run has been handled already - testing locally i receive same timeout error however applying this change resulted in a succeed run)
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72974/rehearse-72974-periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ovn-two-node-fencing-degraded-techpreview/2005532130192396288

the reason these 3 failed because in the before all section in this test we were targeting "all control plane nodes" without considering their state (ready/degraded) so a small adjustment fixed that // without modifying any other logic in these tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Neilhamza
Copy link
Contributor Author

@dgoodwin thanks for the note, update the PR description to contain the relevant links

Copy link
Contributor

@xueqzhan xueqzhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you expecting this group of tests run in your job at all? I assume one of the nodes should be healthy. But I don't see the tests running in the subsequent jobs. For example: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72974/rehearse-72974-periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ovn-two-node-fencing-degraded-techpreview/2008094895578812416

// Skip metal jobs if test image pullspec cannot be determined
if jobType.Platform != "metal" || err == nil {
o.Expect(err).NotTo(o.HaveOccurred())
onDiskPKIContent, err = fetchOnDiskCertificates(ctx, kubeClient, oc.AdminConfig(), masters, openshiftTestImagePullSpec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the presence of unready masters on line 124 does not cause any issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wondered that aswell! turns out The fact that one master is NotReady shouldn’t affect gatherCertsFromPlatformNamespaces in line 124. It collects all its data from the cluster API and uses the masters slice just for name-rewriting and cleanup. Unlike the on-disk collection path, it doesn’t pin any helper pods to nodes or reach out to kubelets.

@Neilhamza
Copy link
Contributor Author

Are you expecting this group of tests run in your job at all? I assume one of the nodes should be healthy. But I don't see the tests running in the subsequent jobs. For example: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72974/rehearse-72974-periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ovn-two-node-fencing-degraded-techpreview/2008094895578812416

@xueqzhan in this specific run i skipped these 3 tests manually in the lane
but yes these tests are part of the e2e conformance so they should be ran on a degraded TNF cluster

@dgoodwin
Copy link
Contributor

dgoodwin commented Jan 6, 2026

@Neilhamza I'm worried the skip tests mechanism you're using is too brittle, a single character change in test names is going to take out your jobs and fire regressions on the release board we monitor should TNF make it to blocking. I would encourage a more sophisticated approach within origin itself via a custom suite that largely mimics the main suites you use, but can exclude things more dynamically. TRT can help but I believe there's richer ginko test labelling available now that may be possible to use to limit tests, ideally label the tests you want skipped on TNF.

@Neilhamza
Copy link
Contributor Author

@dgoodwin i like this approach i'll start investegating relevant labeling for these tests i am skipping
but that would require a new jira story/epic

in the mean time - these 3 tests i am targeting in this PR should not be skipped, so could i have your approval here please?
also can you please tell me if you agree with @jaypoulz comment here so i could update the code

thank you

@Neilhamza
Copy link
Contributor Author

/retest

@Neilhamza Neilhamza requested a review from xueqzhan January 7, 2026 12:36
@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@jaypoulz
Copy link
Contributor

jaypoulz commented Jan 8, 2026

@dgoodwin IMHO, the regex skip list approach is the best approach for these kinds of operations.
Introducing a new suite:

  1. Changes the names of the existing tests, which can break other tests for the very reason you identified
  2. Increases the number of tests suite definitions that need to be maintained. We want to run all relevant e2e-parallel tests, so it makes sense for that to be the e2e-parallel suite.

Introducing a new tag:

  1. Changes the names of the existing tests, which can break other tests for the very reason you identified
  2. Needs to be documented somewhere for others to use properly when adding new tests. To my knowledge, we don't have documentation for the existing tags besides the upstream ones, so this is already an issue.
  3. Needs to be maintained over time, so we'll need to remove (or add) the tag was we fix these tests, which is more disruptive than modifying the regex skips (which are job specific).

Additionally, the concern about this breaking component readiness:

  • We (edge-payload-managers) want to identify breakages that affect TNF configurations. TNF is its own topology in component readiness, so this helps us find and respond to real issues.
  • There should already be ways to filter-out some topologies when examining regressions. If this is a concern at the release-readiness level, we should look into that. I know for a fact that this is what we did for multi-arch.

Finally, none of these jobs are linked to the release controller, so none of these results should affect payload acceptance.

topo, topoErr := exutil.GetControlPlaneTopology(oc)
o.Expect(topoErr).NotTo(o.HaveOccurred())

if *topo == configv1.DualReplicaTopologyMode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should be able to refine this even further. Maybe what we need here is to add a flag to openshift-tests that allows us to specificy that we are running the suite in degraded mode. We know we're in degraded mode because the test pod has an env var set. What we'd need to do is to change the invocation of the test suite so that we check for that env var and pass it to the test suite as extra configuration/context. Then this test could make an except on our topology only when that configuration is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed changes

@Neilhamza Neilhamza requested a review from jaypoulz January 8, 2026 16:38
Copy link
Contributor

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this. :) We should make a note in one of our team docs that is loaded into NotebookLM that we need to set the DEGRADED_NODE env var to test this locally.

@jaypoulz
Copy link
Contributor

jaypoulz commented Jan 8, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 8, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jaypoulz, Neilhamza
Once this PR has been reviewed and has the lgtm label, please assign petr-muller for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants