-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPEDGE-2303: update test logic for degraded cluster run #30649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: nhamza <[email protected]>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@Neilhamza: This pull request references OCPEDGE-2303 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@Neilhamza: This pull request references OCPEDGE-2303 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
2 similar comments
|
/retest |
|
/retest |
jaypoulz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I like the change.
That said, I don't think we should skip the test if there are no ready control plane nodes. I believe no ready control plane nodes is a valid reason to fail and provide an error. That said, filtering out not-ready nodes is a valid case for degraded mode.
I would check in with the test authors to see if they prefer that stick with the current behavior for non-degraded mode tests. It's helpful for tests to fail if your cluster is in an unhealthy state, so we don't what to mask issues. We can even refine the adjustment so we only allow 1 control-plane node to be not-ready and only in degraded mode when using TNF topology.
|
What suite are you running for this job putting the cluster into a degraded state, because from the sounds of it I would expect the standard suite to throw a whole pile of errors beyond just these. Can I see a job run where this occurred? |
@dgoodwin however these 3 tests aren't supposed to fail just because the cluster is degraded so a small adjustment could fix that |
|
Links for the config please @Neilhamza to aid the reviewers, and I could still use a link directly to a job run failing on these tests. |
|
@Neilhamza: This pull request references OCPEDGE-2303 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@Neilhamza: This pull request references OCPEDGE-2303 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@dgoodwin thanks for the note, update the PR description to contain the relevant links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you expecting this group of tests run in your job at all? I assume one of the nodes should be healthy. But I don't see the tests running in the subsequent jobs. For example: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72974/rehearse-72974-periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ovn-two-node-fencing-degraded-techpreview/2008094895578812416
| // Skip metal jobs if test image pullspec cannot be determined | ||
| if jobType.Platform != "metal" || err == nil { | ||
| o.Expect(err).NotTo(o.HaveOccurred()) | ||
| onDiskPKIContent, err = fetchOnDiskCertificates(ctx, kubeClient, oc.AdminConfig(), masters, openshiftTestImagePullSpec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the presence of unready masters on line 124 does not cause any issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wondered that aswell! turns out The fact that one master is NotReady shouldn’t affect gatherCertsFromPlatformNamespaces in line 124. It collects all its data from the cluster API and uses the masters slice just for name-rewriting and cleanup. Unlike the on-disk collection path, it doesn’t pin any helper pods to nodes or reach out to kubelets.
@xueqzhan in this specific run i skipped these 3 tests manually in the lane |
|
@Neilhamza I'm worried the skip tests mechanism you're using is too brittle, a single character change in test names is going to take out your jobs and fire regressions on the release board we monitor should TNF make it to blocking. I would encourage a more sophisticated approach within origin itself via a custom suite that largely mimics the main suites you use, but can exclude things more dynamically. TRT can help but I believe there's richer ginko test labelling available now that may be possible to use to limit tests, ideally label the tests you want skipped on TNF. |
|
@dgoodwin i like this approach i'll start investegating relevant labeling for these tests i am skipping in the mean time - these 3 tests i am targeting in this PR should not be skipped, so could i have your approval here please? thank you |
|
/retest |
Signed-off-by: nhamza <[email protected]>
|
Scheduling required tests: |
|
@dgoodwin IMHO, the regex skip list approach is the best approach for these kinds of operations.
Introducing a new tag:
Additionally, the concern about this breaking component readiness:
Finally, none of these jobs are linked to the release controller, so none of these results should affect payload acceptance. |
test/extended/operators/certs.go
Outdated
| topo, topoErr := exutil.GetControlPlaneTopology(oc) | ||
| o.Expect(topoErr).NotTo(o.HaveOccurred()) | ||
|
|
||
| if *topo == configv1.DualReplicaTopologyMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should be able to refine this even further. Maybe what we need here is to add a flag to openshift-tests that allows us to specificy that we are running the suite in degraded mode. We know we're in degraded mode because the test pod has an env var set. What we'd need to do is to change the invocation of the test suite so that we check for that env var and pass it to the test suite as extra configuration/context. Then this test could make an except on our topology only when that configuration is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed changes
Signed-off-by: nhamza <[email protected]>
jaypoulz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this. :) We should make a note in one of our team docs that is loaded into NotebookLM that we need to set the DEGRADED_NODE env var to test this locally.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jaypoulz, Neilhamza The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
while working on Two Node Fencing in a degraded mode, i ran e2e tests and 3 of these cert tests failed:
[sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel]
[sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel]
the lane that runs TNF in a degraded mode can be found in the config:
https://github.com/openshift/release/blob/master/ci-operator/config/openshift/release/openshift-release-master__nightly-4.22.yaml#L587
an example run in which these 3 failed: (the other failed tests in this run has been handled already - testing locally i receive same timeout error however applying this change resulted in a succeed run)
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72974/rehearse-72974-periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ovn-two-node-fencing-degraded-techpreview/2005532130192396288
the reason these 3 failed because in the before all section in this test we were targeting "all control plane nodes" without considering their state (ready/degraded) so a small adjustment fixed that // without modifying any other logic in these tests