CORENET-6066: test(e2e): add e2e test for zero-worker HyperShift clusters in daemonset rollout#8176
CORENET-6066: test(e2e): add e2e test for zero-worker HyperShift clusters in daemonset rollout#8176weliang1 wants to merge 8 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@weliang1: This pull request references CORENET-6064 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughA new e2e test Sequence Diagram(s)sequenceDiagram
participant TestHarness as Test Harness
participant HostAPI as HostedCluster API
participant CPDeploy as ovnkube-control-plane Deployment
participant NodeDS as ovnkube-node DaemonSet
participant GuestAPI as Guest Kube API (hosted)
participant ClusterOp as network ClusterOperator
TestHarness->>HostAPI: Derive hosted control-plane namespace
TestHarness->>CPDeploy: Wait for Deployment Available / ReadyReplicas>0
TestHarness->>NodeDS: Check DaemonSet presence
alt DaemonSet missing
Note right of TestHarness: acceptable
else DaemonSet present
TestHarness->>NodeDS: Assert DesiredNumberScheduled, NumberAvailable, NumberUnavailable == 0
TestHarness->>NodeDS: Assert ObservedGeneration == Generation
end
alt Upgrade image provided and differs
TestHarness->>HostAPI: Patch hostedCluster.spec.release.image
TestHarness->>CPDeploy: Wait for rollout (generation, ready/updated == desired)
TestHarness->>CPDeploy: Verify container image changed
Note right of TestHarness: For supported HyperShift versions also wait for control-plane rollout and ControlPlaneVersion
end
TestHarness->>GuestAPI: Create guest kube client
loop Poll until success
GuestAPI->>ClusterOp: Get network ClusterOperator (unstructured)
ClusterOp-->>GuestAPI: Conditions (Available/Progressing/Degraded)
end
TestHarness->>CPDeploy: Final readiness check (ReadyReplicas>0)
TestHarness->>NodeDS: Final absence or zero desired pods check
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: weliang1 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@weliang1: This pull request references CORENET-6066 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@weliang1: This pull request references CORENET-6066 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
b6f7f5d to
c2fa99d
Compare
|
/jira refresh |
|
@weliang1: This pull request references CORENET-6066 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@weliang1: This pull request references CORENET-6066 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8176 +/- ##
==========================================
+ Coverage 32.16% 39.70% +7.54%
==========================================
Files 766 774 +8
Lines 91957 94891 +2934
==========================================
+ Hits 29575 37675 +8100
+ Misses 59855 54514 -5341
- Partials 2527 2702 +175 see 257 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
…in daemonset rollout Verifies that OVN control plane components can successfully upgrade in HyperShift clusters with zero worker nodes. This test validates: - Initial OVN deployment readiness with zero workers - OVN DaemonSet behavior (not created or reports 0 desired) - Control plane upgrade from version X to Y - OVN pod rollout during upgrade - All control plane components complete rollout - Network ClusterOperator remains healthy - No degradation or pod crashes The test addresses scenarios such as: - Data plane hibernation (workers scaled to zero for cost savings) - Autoscaling from zero (no workers until workload arrives) - Management cluster updates when worker nodes are unreachable Validated on live cluster: - Cluster: hypershift-ci-373084 - Upgrade: 4.22.0-223038 → 051707 - Workers: 0 throughout test - Duration: ~10 minutes - Result: All 8 steps passed, 0 pod restarts Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
c2fa99d to
dc2b23a
Compare
|
/test all |
|
/remove-label do-not-merge/work-in-progress |
|
@weliang1: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@weliang1: This pull request references CORENET-6066 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
…rker test Addressed CodeRabbit review feedback: 1. Use cancellable ctx instead of testContext in WaitForGuestClient 2. Add safe type assertions with comma-ok checks for condition parsing 3. Fix confusing log output by removing negated booleans Framework fix: 4. Use NonePlatform instead of globalOpts.Platform to skip framework validation that expects worker nodes. This matches the approach used by TestHAEtcdChaos for zero-worker scenarios. The test validates OVN control plane behavior with zero workers, which is platform-agnostic. NonePlatform allows the test to focus on OVN-specific validation without requiring cloud provider resources or worker nodes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
7e0206f to
997a620
Compare
|
@weliang1: This pull request references CORENET-6066 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
/pipeline required |
|
Scheduling tests matching the |
NonePlatform does not deploy OVN-Kubernetes components, causing the test to fail when looking for ovnkube-control-plane deployment. The test needs a real platform (AWS) that deploys OVN networking components. The framework validation correctly handles zero-worker clusters through clusterOpts.ExpectedNodeCount(), adjusting condition expectations for clusters without worker nodes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/test e2e-aws |
|
@weliang1: This pull request references CORENET-6066 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/e2e/ovn_control_plane_zero_workers_test.go (1)
126-131: Don't skip the entire test when only the upgrade image is missing.
Line 130turns the whole test intoSKIP, which also drops the non-upgrade coverage from Steps 1-2 and the post-upgrade-independent health checks later in the test. It would be better to gate only the upgrade-specific steps (or split them into a subtest) so zero-worker OVN validation still runs in jobs withoutLatestReleaseImage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/ovn_control_plane_zero_workers_test.go` around lines 126 - 131, The test currently calls t.Skip() when upgradeImage (globalOpts.LatestReleaseImage) is empty or equal to baselineImage, which skips the entire test; instead, change the flow so only upgrade-specific steps are gated: check upgradeImage and if missing/equal only skip or return from the upgrade-related block (the steps that perform the upgrade and post-upgrade validation) or move those steps into a subtest (t.Run("upgrade", ...)) that is skipped, while allowing the initial zero-worker OVN validation and post-upgrade-independent health checks to always run; update references to upgradeImage, baselineImage and any t.Skip calls accordingly so the rest of the test is still executed when LatestReleaseImage is not provided.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/ovn_control_plane_zero_workers_test.go`:
- Around line 154-210: The rollout predicate for the "ovnkube-control-plane"
Deployment can return true on the pre-upgrade revision; update the Eventually
check in the goroutine that reads deployment (the block creating deployment :=
&appsv1.Deployment{} inside g.Eventually) to also verify the pod image has
changed from the recorded baselineImage before returning true: after checking
ready==desired, updated==desired and observedGeneration==generation, fetch the
first container image from deployment.Spec.Template.Spec.Containers[0].Image
and, if baselineImage is non-empty, require newImage != baselineImage (or skip
the image check only when baselineImage is empty) so Eventually only succeeds
once the Deployment rollout actually reflects the new image.
---
Nitpick comments:
In `@test/e2e/ovn_control_plane_zero_workers_test.go`:
- Around line 126-131: The test currently calls t.Skip() when upgradeImage
(globalOpts.LatestReleaseImage) is empty or equal to baselineImage, which skips
the entire test; instead, change the flow so only upgrade-specific steps are
gated: check upgradeImage and if missing/equal only skip or return from the
upgrade-related block (the steps that perform the upgrade and post-upgrade
validation) or move those steps into a subtest (t.Run("upgrade", ...)) that is
skipped, while allowing the initial zero-worker OVN validation and
post-upgrade-independent health checks to always run; update references to
upgradeImage, baselineImage and any t.Skip calls accordingly so the rest of the
test is still executed when LatestReleaseImage is not provided.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 4b2e3eab-4f92-42a7-a589-99ea89428359
📒 Files selected for processing (1)
test/e2e/ovn_control_plane_zero_workers_test.go
Address CodeRabbit finding: The rollout predicate could return true on the pre-upgrade revision if the deployment was already ready with the old image. Changes: - Capture baseline generation in addition to baseline image - Verify deployment.Generation has changed from baseline - Verify container image has changed from baseline - Only return true when both generation and image have changed AND all replicas are ready/updated This ensures Eventually waits for the actual upgrade rollout to complete rather than returning immediately on the pre-upgrade state. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/test e2e-aws |
1 similar comment
|
/test e2e-aws |
…tests The standard Execute() method runs EnsureHostedCluster validation in the after() phase, which incorrectly defaults hasWorkerNodes=true for private or non-public clusters. This causes ValidateHostedClusterConditions to expect worker-dependent conditions (DataPlaneConnectionAvailable, ControlPlaneConnectionAvailable, ClusterVersionAvailable) that cannot be satisfied in zero-worker cluster configurations. This commit adds ExecuteWithoutEnsureValidation() method that: - Skips the problematic after() validation (EnsureHostedCluster) - Still runs before() validation which correctly uses opts.ExpectedNodeCount() - Allows tests to provide their own comprehensive validation - Is specifically designed for non-standard cluster configurations The TestOVNControlPlaneZeroWorkers test is updated to use this new method, as it already provides comprehensive Steps 1-8 validation for OVN components in zero-worker clusters. This fixes the CI failure where the test timed out waiting for conditions that cannot be met without worker nodes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/test verify-deps |
|
/cc @kyrtapz |
|
@enxebre @devguyio |
Integrate OVN zero-worker validation as a subtest in TestUpgradeControlPlane instead of creating a separate test with a new HostedCluster. Changes: - Remove standalone TestOVNControlPlaneZeroWorkers test - Add "Validate OVN control plane with zero workers" subtest to TestUpgradeControlPlane - Scale NodePool to zero after upgrade completion - Verify ovnkube-node DaemonSet reports DesiredNumberScheduled == 0 - Verify ovnkube-control-plane Deployment remains healthy - Verify network ClusterOperator remains healthy with zero workers This approach: - Saves CI resources by reusing the upgraded cluster - Still validates CORENET-6066 fix (CNO handling DesiredNumberScheduled == 0) - Tests realistic "data plane hibernation after upgrade" scenario Fixes: https://redhat.atlassian.net/browse/CORENET-6066 Related: openshift/cluster-network-operator#2897
|
/test e2e-aws |
After validating OVN with zero workers, scale the NodePool back to its original replica count before the framework's EnsureHostedCluster validation runs. This prevents framework validation failures due to worker-dependent cluster operators (image-registry, ingress) and connectivity checks being unavailable with zero workers. Flow: 1. Complete upgrade with normal worker count ✓ 2. Scale NodePool to zero 3. Validate OVN control plane with zero workers ✓ 4. Scale NodePool back up (NEW) 5. Wait for nodes to become ready (NEW) 6. Framework validation passes ✓ Fixes the issue where EnsureHostedCluster failed with: - DataPlaneConnectionAvailable=Unknown: NoWorkerNodesAvailable - ClusterVersionSucceeding=False: Cluster operators image-registry, ingress not available Related: https://redhat.atlassian.net/browse/CORENET-6066
|
/test e2e-aws |
|
@weliang1: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test e2e-aws |
Switch TestUpgradeControlPlane to use ExecuteWithoutEnsureValidation to avoid HostedCluster condition validation race after scaling workers back from zero. After the zero-worker validation completes and workers are scaled back to 2 replicas, cluster operators (image-registry, ingress) need additional time to reconcile before HostedCluster conditions reflect healthy state. Node Ready status does not guarantee operator availability. The ExecuteWithoutEnsureValidation method was created specifically for this scenario but was not being used, causing test timeouts on the EnsureHostedCluster validation step. Fixes: openshift#8176 (comment)
|
Now I have all the evidence needed. Let me compile the final report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth Enterprise Contract (EC) checks fail with the same 2 violations of the Root CauseThe Konflux Enterprise Contract policy The fix for this was committed to the PR #8176's branch ( This is confirmed by cross-checking 11 other open PRs:
Recommendations
Evidence
|
@enxebre Your feedback was addressed as integrating the test into TestUpgradeControlPlane. cc: @devguyio @sjenning |
What this PR does / why we need it:
Adds comprehensive e2e test for OVN control plane with zero workers to verify control plane upgrade capability without worker nodes.
This test validates that OVN control plane components can successfully deploy and upgrade in HyperShift clusters with zero worker nodes, addressing scenarios such as:
Test coverage:
Which issue(s) this PR fixes:
Fixes https://redhat.atlassian.net/browse/CORENET-6066
Special notes for your reviewer:
Checklist:
🤖 Generated with Claude Code
Summary by CodeRabbit