Skip to content

CNTRLPLANE-3222: Port v1 lifecycle tests to v2 Ginkgo framework#8527

Open
bryan-cox wants to merge 11 commits into
openshift:mainfrom
bryan-cox:CNTRLPLANE-3222
Open

CNTRLPLANE-3222: Port v1 lifecycle tests to v2 Ginkgo framework#8527
bryan-cox wants to merge 11 commits into
openshift:mainfrom
bryan-cox:CNTRLPLANE-3222

Conversation

@bryan-cox
Copy link
Copy Markdown
Member

@bryan-cox bryan-cox commented May 15, 2026

Summary

Ports 4 remaining v1 lifecycle tests to the v2 Ginkgo framework with platform-agnostic labels, enabling them to run in the Azure self-managed CI pipeline (and any future platform CI).

  • NodePool autoscaling (nodepool-autoscaling): scale-up/down and multi-NodePool balancing
  • NodePool lifecycle (nodepool-lifecycle): 13 sub-tests covering machineconfig, NTO, upgrades (replace/in-place/rolling), previous releases, mirror configs, trust bundles, performance profiles, auto-repair, and disk encryption
  • Control plane upgrade (control-plane-upgrade): N-1 → latest release upgrade with component and version rollout verification
  • Etcd chaos (etcd-chaos): 5 sub-tests for single member recovery, random/all member kills, WAL corruption, and missing member recovery

Also includes:

  • Widened 12 e2eutil helper functions from *testing.T to testing.TB for Ginkgo compatibility (GinkgoTB() returns testing.TB, not *testing.T)
  • create-guests binary for CI cluster creation (wraps hypershift CLI, waits for availability)
  • 6 new env var registrations for release images, Azure creds, and DES ID

Design decisions

  • Upgrade + etcd-chaos share a single HA cluster (created at N-1, upgraded, then chaos-tested)
  • Autoscaling + lifecycle run on the existing public cluster
  • AutoRepair and DiskEncryption are skeleton tests (need cloud SDK dependencies)
  • Functions using t.Run() (e.g. EnsureNoCrashingPods) remain as TODOs since testing.TB doesn't expose Run()

Companion PR needed: openshift/release changes for CI wiring (create/destroy/dump scripts, label filters, job config)

Test plan

  • make verify passes
  • make e2e builds clean (v1 backwards-compatible after testing.TB change)
  • make e2ev2 builds clean
  • go build -tags e2ev2 ./test/e2e/v2/cmd/create-guests/ compiles
  • go vet -tags e2ev2 ./test/e2e/v2/... clean
  • CI: /test e2e-azure-v2-self-managed (existing tests still pass)
  • CI: full lifecycle test run after release repo companion PR lands

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests

    • Expanded e2e coverage with suites for control-plane upgrades, etcd chaos, nodepool autoscaling, and comprehensive nodepool lifecycle scenarios; registered many new test cases and helpers.
  • New Features

    • Added v2 e2e helper programs to create, run, destroy, and collect dumps from guest clusters; introduced a test orchestrator to run groups of tests concurrently and sequence upgrade/chaos flows.
  • Chores

    • Added Makefile targets, Docker image updates, and new environment variables to support the v2 e2e workflows.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 15, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 15, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 15, 2026

@bryan-cox: This pull request references CNTRLPLANE-3222 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Ports 4 remaining v1 lifecycle tests to the v2 Ginkgo framework with platform-agnostic labels, enabling them to run in the Azure self-managed CI pipeline (and any future platform CI).

  • NodePool autoscaling (nodepool-autoscaling): scale-up/down and multi-NodePool balancing
  • NodePool lifecycle (nodepool-lifecycle): 13 sub-tests covering machineconfig, NTO, upgrades (replace/in-place/rolling), previous releases, mirror configs, trust bundles, performance profiles, auto-repair, and disk encryption
  • Control plane upgrade (control-plane-upgrade): N-1 → latest release upgrade with component and version rollout verification
  • Etcd chaos (etcd-chaos): 5 sub-tests for single member recovery, random/all member kills, WAL corruption, and missing member recovery

Also includes:

  • Widened 12 e2eutil helper functions from *testing.T to testing.TB for Ginkgo compatibility (GinkgoTB() returns testing.TB, not *testing.T)
  • create-guests binary for CI cluster creation (wraps hypershift CLI, waits for availability)
  • 6 new env var registrations for release images, Azure creds, and DES ID

Design decisions

  • Upgrade + etcd-chaos share a single HA cluster (created at N-1, upgraded, then chaos-tested)
  • Autoscaling + lifecycle run on the existing public cluster
  • AutoRepair and DiskEncryption are skeleton tests (need cloud SDK dependencies)
  • Functions using t.Run() (e.g. EnsureNoCrashingPods) remain as TODOs since testing.TB doesn't expose Run()

Companion PR needed: openshift/release changes for CI wiring (create/destroy/dump scripts, label filters, job config)

Test plan

  • make verify passes
  • make e2e builds clean (v1 backwards-compatible after testing.TB change)
  • make e2ev2 builds clean
  • go build -tags e2ev2 ./test/e2e/v2/cmd/create-guests/ compiles
  • go vet -tags e2ev2 ./test/e2e/v2/... clean
  • CI: /test e2e-azure-v2-self-managed (existing tests still pass)
  • CI: full lifecycle test run after release repo companion PR lands

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/ai Indicates the PR includes changes related to AI - Claude agents, Cursor rules, etc. area/api Indicates the PR includes changes for the API area/ci-tooling Indicates the PR includes changes for CI or tooling area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/control-plane-pki-operator Indicates the PR includes changes for the control plane PKI operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 15, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds v2 Ginkgo E2E suites and orchestration tooling: control-plane upgrade, etcd chaos (five scenarios), nodepool autoscaling, and nodepool lifecycle tests. Widened e2e helper signatures to testing.TB, registered additional lifecycle env vars, added e2ev2 CLI binaries (create-guests, run-tests, destroy-guests, dump-guests), and wired Makefile and Dockerfile to build and package those binaries.

Sequence Diagram(s)

sequenceDiagram
  participant CreateCLI as create-guests
  participant HypershiftCLI as hypershift
  participant MgmtAPI as ManagementCluster API
  participant HostedCluster as HostedCluster CR

  CreateCLI->>HypershiftCLI: exec "hypershift create cluster azure" (args per variant)
  HypershiftCLI->>MgmtAPI: submit HostedCluster CR / related resources
  MgmtAPI->>HostedCluster: create/update HostedCluster
  HostedCluster->>MgmtAPI: status updates (Available, Version History)
  MgmtAPI->>CreateCLI: status observed via controller-runtime watch
Loading

Possibly related PRs

  • openshift/hypershift#7152: Prior work on the v2 ginkgo-based e2e-v2 test infrastructure used by the new orchestration and env var wiring.

Suggested reviewers

  • sjenning
  • jparrill
🚥 Pre-merge checks | ✅ 7 | ❌ 5

❌ Failed checks (5 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning One Expect assertion lacks a meaningful error message in control_plane_upgrade_test.go:68. Additionally, cleanup patterns are inconsistent between tests—some use defer while others use DeferCleanup. Add assertion message to line 68 Expect call. Standardize cleanup patterns to use DeferCleanup consistently throughout tests, which is idiomatic for Ginkgo v2.
Microshift Test Compatibility ⚠️ Warning Tests use MicroShift-unavailable APIs (MachineConfig, KubeletConfig, PerformanceProfile, etcd pods) without protection mechanisms like [Skipped:MicroShift] or [apigroup:...] labels. Add [Skipped:MicroShift] or [apigroup:...] tags to tests using unavailable APIs: MachineConfig, KubeletConfig, PerformanceProfile, and etcd chaos tests.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning Tests assume multi-node/HA clusters without SNO skip guards: NodePoolRollingUpgradeTest (2 replicas), AutoscalingTests (3-4 nodes), EtcdChaosTests (multi-member etcd). Add [Skipped:SingleReplicaTopology] labels or exutil.IsSingleNode() skip checks to: NodePoolRollingUpgradeTest, AutoscalingScaleUpDownTest, AutoscalingBalancingTest, all EtcdChaosTests (5 tests).
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning The test in nodepool_lifecycle_test.go (lines 1283, 1354) hardcodes external registry.access.redhat.com image pulls, failing IPv6-only disconnected CI requirements. Replace hardcoded registry.access.redhat.com/ubi9/ubi-minimal:latest with environment variable or cluster-internal registry reference for DaemonSet verification images to support disconnected environments.
✅ Passed checks (7 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: porting v1 lifecycle tests to v2 Ginkgo framework. It is specific, concise, and directly relates to the primary objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test titles are stable. No dynamic information (UUIDs, timestamps, pod/node names, IP addresses) found in test names. All titles use static semantic descriptions.
Topology-Aware Scheduling Compatibility ✅ Passed Not applicable: PR contains only test code and build system changes. Check requires "deployment manifests, operator code, or controllers"—none present here.
Ote Binary Stdout Contract ✅ Passed No violations found. test-e2e-v2 redirects all process-level output to stderr via PrintEnvVarHelp() to os.Stderr and BeforeSuite() zap logging. New test files have no problematic stdout writes.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8527 May 15, 2026 13:13 Inactive
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.40%. Comparing base (a7d68da) to head (d428dc1).
⚠️ Report is 21 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8527      +/-   ##
==========================================
+ Coverage   40.34%   40.40%   +0.06%     
==========================================
  Files         755      755              
  Lines       93167    93235      +68     
==========================================
+ Hits        37587    37675      +88     
+ Misses      52877    52858      -19     
+ Partials     2703     2702       -1     

see 3 files with indirect coverage changes

Flag Coverage Δ
cmd-support 34.44% <ø> (+0.13%) ⬆️
cpo-hostedcontrolplane 41.76% <ø> (ø)
cpo-other 40.31% <ø> (+0.17%) ⬆️
hypershift-operator 50.72% <ø> (ø)
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (9)
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go (2)

1543-1544: 💤 Low value

Misleading %w in log message.

The %w verb is a fmt directive for error wrapping, but r.Log.Error() doesn't perform formatting on the message string. The error is already passed as the first argument and will be logged correctly; the %w is printed literally.

Suggested fix
 if err != nil {
-    r.Log.Error(err, "failed to remove service ca annotation and secret: %w")
+    r.Log.Error(err, "failed to remove service ca annotation and secret")
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`
around lines 1543 - 1544, The log call in hostedcontrolplane_controller.go uses
the fmt verb "%w" in the message string (`r.Log.Error(err, "failed to remove
service ca annotation and secret: %w")`), which is meaningless for the logger
and will be printed literally; fix it by removing the "%w" from the message so
the error is passed as the first argument and logged correctly (e.g.,
`r.Log.Error(err, "failed to remove service ca annotation and secret")`), or
alternatively keep the error and add structured context via key/value pairs to
the same `r.Log.Error` call.

1721-1729: 💤 Low value

Same %w issue and unnecessary intermediate variable.

The log messages at lines 1722 and 1755 have the same misleading %w issue as flagged earlier. Additionally, the z variable at lines 1725-1726 and 1757-1759 is unnecessary—the reconcile function result can be returned directly.

Suggested fixes
 if err := removeServiceCAAnnotationAndSecret(ctx, r.Client, AzureDiskCsiDriverOperatorService, AzureDiskCsiDriverOperatorServingCert); err != nil {
-    r.Log.Error(err, "failed to remove service ca annotation and secret: %w")
+    r.Log.Error(err, "failed to remove service ca annotation and secret")
 }
 if _, err := createOrUpdate(ctx, r, AzureDiskCsiDriverOperatorServingCert, func() error {
-    z := pki.ReconcileAzureDiskCsiDriverOperatorMetricsServingCertSecret(AzureDiskCsiDriverOperatorServingCert, rootCASecret, p.OwnerRef)
-    return z
+    return pki.ReconcileAzureDiskCsiDriverOperatorMetricsServingCertSecret(AzureDiskCsiDriverOperatorServingCert, rootCASecret, p.OwnerRef)
 }); err != nil {

Apply the same pattern at lines 1754-1759 for the Azure File CSI driver operator.

Also applies to: 1752-1762

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`
around lines 1721 - 1729, The r.Log.Error calls use a printf-style %w
placeholder incorrectly and should be called as r.Log.Error(err, "failed to
remove service ca annotation and secret") (i.e., pass the error as the first
argument to r.Log.Error without formatting), and the reconciliation closures
create an unnecessary intermediate variable (z); replace constructs like "z :=
pki.ReconcileAzureDiskCsiDriverOperatorMetricsServingCertSecret(...); return z"
with "return
pki.ReconcileAzureDiskCsiDriverOperatorMetricsServingCertSecret(...)" inside the
createOrUpdate call (apply the same fixes for AzureDiskCsiDriverOperator* and
the analogous AzureFileCsiDriverOperator* usages, and for functions
removeServiceCAAnnotationAndSecret, createOrUpdate, and
pki.ReconcileAzure...ServingCertSecret).
api/karpenter/v1/kubelet_config.go (1)

50-50: 💤 Low value

Consider simplifying or documenting the complex eviction threshold validation.

The XValidation rule on line 50 is correct but extremely complex (>300 characters). It validates that evictionSoft >= evictionHard for matching signals, handling both percentage strings and quantity types. While the logic is sound, this level of complexity in a single CEL expression makes it difficult to:

  • Review for correctness
  • Debug when validation fails
  • Maintain if requirements change

Consider adding an inline comment above the validation explaining the high-level intent, or explore whether this could be split into separate validation rules (though CEL may not permit that given the cross-field nature).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/karpenter/v1/kubelet_config.go` at line 50, The XValidation CEL
expression on the kubebuilder annotation that enforces evictionSoft >=
evictionHard for matching signals is correct but unreadable; add a clear inline
comment directly above the annotation describing the high-level intent (e.g.,
"Ensure evictionSoft thresholds are greater than or equal to evictionHard for
the same signal, handling percentage strings and resource quantities") and
briefly note the two cases handled (percentage strings vs. resource quantities),
or alternatively factor the check into multiple, named XValidation annotations
if feasible; reference the annotation (the +kubebuilder:validation:XValidation
line) and the related fields evictionSoft and evictionHard so reviewers can find
and understand the rule quickly.
api/hypershift/v1beta1/hostedcluster_types.go (1)

676-681: ⚡ Quick win

Clarify pull secret propagation mechanism for in-place updates.

Lines 678-680 state that updating the Secret's data in place "does not trigger that rollout," but then immediately say the changes "will still propagate the updated credentials down to the guest cluster and kubelet config" for AWS/Azure Replace strategy. This appears contradictory—how do the credentials propagate without triggering a rollout?

Consider clarifying:

  • Does "propagate" mean existing nodes are updated in place without replacement?
  • Is this via a separate controller watching the Secret?
  • Does "rollout" specifically mean node replacement, while propagation is a different mechanism?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/hypershift/v1beta1/hostedcluster_types.go` around lines 676 - 681, The
comment is ambiguous about how Secret data changes "do not trigger that rollout"
yet "will still propagate" for AWS/Azure Replace NodePools; update the paragraph
(the NodePool/pull secret description) to explicitly state that “rollout” refers
to NodePool replacement/in-place upgrade actions (node recreation), and that
updating the referenced Secret's data in-place does not automatically trigger
node replacement, but cloud-provider-specific controllers or credential
reconciliation (e.g., the cloud provider credential controller or kubelet secret
refresh mechanisms) will propagate updated credentials to the guest cluster and
kubelet config for AWS/Azure Replace strategy; mention whether propagation is
in-place (no node replacement) and name the controller or mechanism if
available, or note “a separate controller/watch reconciler” if unspecified, so
readers understand the distinct behaviors for secret updates vs. NodePool
rollouts.
control-plane-operator/controllers/azureprivatelinkservice/controller.go (1)

863-881: 💤 Low value

Good refactoring of deletion logic into focused helpers.

The extraction of deletion logic into dedicated helper functions improves maintainability and testability. Using azPLS.Status.DNSZoneName to avoid dependency on the HostedControlPlane during deletion is a solid design decision.

The verbose log at line 873 when DNSZoneName is not set uses log.V(1), which may not be visible by default. Consider whether this should be a regular log.Info given that it represents a potentially unexpected condition during cleanup.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/azureprivatelinkservice/controller.go`
around lines 863 - 881, The log for the case where azPLS.Status.DNSZoneName is
empty uses log.V(1).Info which may be too quiet for an unexpected cleanup
condition; update the logging in the controller deletion path (the dnsZoneName
check in the reconcile/finalizer routine that calls deleteDNSResources and
deleteBaseDomainResources) to use log.Info (or a higher visibility) instead of
log.V(1).Info so the message "DNSZoneName not set in status, skipping DNS
cleanup" is visible by default when DNSZoneName is missing during deletion.
control-plane-operator/controllers/azureprivatelinkservice/controller_test.go (1)

1747-1774: ⚡ Quick win

Consolidate duplicated errMsgQualifier coverage.

Line 1747 adds TestDNSZoneConfigErrMsgQualifier, but this is functionally duplicating TestErrMsgQualifier later in the file (same inputs and expected outputs). Keeping one avoids drift and redundant maintenance.

Suggested cleanup
-func TestDNSZoneConfigErrMsgQualifier(t *testing.T) {
-	t.Parallel()
-	tests := []struct {
-		name      string
-		logPrefix string
-		expected  string
-	}{
-		{
-			name:      "When logPrefix is empty, it should return empty string",
-			logPrefix: "",
-			expected:  "",
-		},
-		{
-			name:      "When logPrefix is set, it should return prefix followed by a space",
-			logPrefix: "base domain",
-			expected:  "base domain ",
-		},
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			t.Parallel()
-			g := NewGomegaWithT(t)
-			cfg := dnsZoneConfig{logPrefix: tt.logPrefix}
-			g.Expect(cfg.errMsgQualifier()).To(Equal(tt.expected))
-		})
-	}
-}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/azureprivatelinkservice/controller_test.go`
around lines 1747 - 1774, Remove the duplicated unit test
TestDNSZoneConfigErrMsgQualifier and keep the existing TestErrMsgQualifier to
avoid redundant coverage; locate the duplicate test function named
TestDNSZoneConfigErrMsgQualifier in the controller_test.go file (it constructs
dnsZoneConfig{logPrefix: ...} and calls cfg.errMsgQualifier()) and delete that
entire test block so only the original TestErrMsgQualifier remains exercising
dnsZoneConfig.errMsgQualifier.
contrib/repo_metrics/weekly_pr_report.py (2)

965-973: ⚡ Quick win

Add strict=True to zip() for safety.

Line 965 zips repos_to_fetch and results which should have the same length by construction, but adding strict=True makes this assumption explicit and catches bugs if the invariant is violated.

Note: strict= requires Python 3.10+.

🔒 Proposed fix
-        for (owner, name), result in zip(repos_to_fetch, results):
+        for (owner, name), result in zip(repos_to_fetch, results, strict=True):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@contrib/repo_metrics/weekly_pr_report.py` around lines 965 - 973, The loop
that pairs repos_to_fetch with results uses zip(repos_to_fetch, results) and
assumes equal lengths; change it to zip(repos_to_fetch, results, strict=True) to
enforce that invariant at runtime and fail fast if lengths differ; update the
loop in the block that assigns to repo_fetch_status and repo_prs (the for
(owner, name), result in zip(...) loop) so it uses strict=True, ensuring any
mismatch between repos_to_fetch and results is caught immediately.

402-403: 💤 Low value

Consider using immutable types for class constants.

Lines 402-403 use mutable list and set for class constants that are never modified. Converting to tuple and frozenset would make the immutability explicit and silence the static analysis warning.

♻️ Proposed refactor
-    BOT_PATTERNS = ['-bot', '-robot', '[bot]']
-    BOT_LOGINS = {'coderabbitai', 'hypershift-jira-solve-ci'}
+    BOT_PATTERNS = ('-bot', '-robot', '[bot]')
+    BOT_LOGINS = frozenset({'coderabbitai', 'hypershift-jira-solve-ci'})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@contrib/repo_metrics/weekly_pr_report.py` around lines 402 - 403,
BOT_PATTERNS and BOT_LOGINS are declared as mutable list/set but never modified;
change BOT_PATTERNS = ['-bot', '-robot', '[bot]'] to an immutable tuple
BOT_PATTERNS = ('-bot', '-robot', '[bot]') and change BOT_LOGINS =
{'coderabbitai', 'hypershift-jira-solve-ci'} to a frozenset: BOT_LOGINS =
frozenset({'coderabbitai', 'hypershift-jira-solve-ci'}), and update any
membership checks (e.g., "in BOT_LOGINS" or iterating BOT_PATTERNS) which will
continue to work without other code changes.
.claude/commands/pr-report.md (1)

732-755: 💤 Low value

Consider documenting alternative environment variable names for backward compatibility.

The Python script (lines 69-70) accepts both JIRA_API_TOKEN/JIRA_USERNAME and JIRA_TOKEN/JIRA_EMAIL, but the documentation only mentions the latter. Consider adding a note about the alternative names for users with existing configurations.

📝 Suggested documentation addition

Add a note under the environment variables table:

 | `GITHUB_TOKEN` | No | GitHub token (falls back to `gh auth token` if not set) |
+
+**Note:** For backward compatibility, the script also accepts `JIRA_USERNAME` (alias for `JIRA_EMAIL`) and `JIRA_API_TOKEN` (alias for `JIRA_TOKEN`).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/commands/pr-report.md around lines 732 - 755, Update the environment
variables docs to mention the alternative legacy names accepted by the script:
add a brief note under the variables table stating that JIRA_TOKEN/JIRA_EMAIL
are equivalent to the older JIRA_API_TOKEN/JIRA_USERNAME (the script checks both
pairs around lines where env vars are read), so existing configurations using
JIRA_API_TOKEN or JIRA_USERNAME will continue to work; reference the variable
names JIRA_TOKEN, JIRA_EMAIL, JIRA_API_TOKEN, and JIRA_USERNAME in the note.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/commands/pr-report.md:
- Around line 115-119: The detection patterns in .claude/commands/pr-report.md
don’t match the actual runtime messages; update the checks to use partial or
regex matching against the real strings — e.g., match "Fetching Jira data via
REST API" (prefix of "Fetching Jira data via REST API ({len(tickets)}
tickets)...") and "Jira credentials not set" (prefix of "Jira credentials not
set (need JIRA_USERNAME/JIRA_API_TOKEN or JIRA_EMAIL/JIRA_TOKEN), loading from
cache") instead of the exact old phrases, so the script recognizes both messages
with variable content.

In `@api/.golangci.yml`:
- Around line 322-342: The review asks to fix the API types instead of adding
kubeapilinter exclusions: update the KubeletConfiguration and
OpenshiftEC2NodeClassSpec types to satisfy kube-apilinter rules—specifically
change CPUCFSQuota (in KubeletConfiguration) from a *bool to a string enum with
meaningful constants; add a kubebuilder:validation:MinProperties marker where
the KubeletConfiguration object is used (referenced by
OpenshiftEC2NodeClassSpec.Kubelet) and make OpenshiftEC2NodeClassSpec.Kubelet a
pointer if it must be optional to avoid the optionalfields warning; replace map
values that use Go Durations (EvictionSoftGracePeriod) with integer types whose
names include units (e.g., EvictionSoftGracePeriodSeconds) and, for the nomaps
warnings on SystemReserved, KubeReserved, EvictionHard, EvictionSoft, refactor
map fields into well-defined structured types or resource-list fields (or
concrete slice/struct representations) so they no longer use raw maps; run make
api-lint-fix and iterate until kube-apilinter passes.

In `@api/karpenter/v1/kubelet_config.go`:
- Around line 196-212: HasTypedFields currently lists typed fields manually
(KubeletConfiguration.HasTypedFields) and can get out of sync with the struct,
causing IsZero/omitzero bugs; replace the manual checks with a reflection-based
implementation or add a unit test that fails if any exported non-overflow struct
field is not checked: implement HasTypedFields by reflecting over
KubeletConfiguration, skipping overflow/XXX fields and detecting any non-zero
typed field (or alternatively write a test that compares the set of field names
in kubeletConfigKnownFields/reflection against the ones asserted in
HasTypedFields to force updates), and ensure IsZero/omitzero behavior remains
correct for all current and future typed fields.

In `@cmd/cluster/core/create.go`:
- Around line 476-484: The code currently returns early when opts.PausedUntil ==
"true" so HostedCluster.Spec.PausedUntil is never set; change the logic in the
function handling opts.PausedUntil to: if opts.PausedUntil is empty return nil;
if opts.PausedUntil == "true" set cluster.Spec.PausedUntil = &opts.PausedUntil
and return nil; otherwise parse opts.PausedUntil with time.RFC3339, return a
wrapped error on parse failure and set cluster.Spec.PausedUntil =
&opts.PausedUntil on success (update the unit test expectation for
HostedCluster.Spec.PausedUntil when "--pausedUntil=true" accordingly).

In `@cmd/fix/dr_oidc_iam.go`:
- Around line 524-568: Add unit tests covering the new OIDC helper branch logic:
write table-driven unit tests for ensureOIDCDocuments and ensureOIDCProvider
that exercise combinations of DryRun true/false, ForceRecreate true/false, and
existing-state flags (oidcDocsExist and exists) to assert correct behavior
(no-op, dry-run messages, calls into generateAndUploadOIDCDocuments,
deleteOIDCProviderIfExists, and createOIDCProvider). Mock/stub client
interactions (k8sClient, s3Client, iamClient) and the helper methods
generateAndUploadOIDCDocuments, deleteOIDCProviderIfExists, createOIDCProvider
to verify they are or are not invoked, and assert returned errors are
propagated/wrapped as expected; include tests for success and error propagation
paths.
- Around line 496-511: The check currently collapses all S3 HeadObject failures
into a false result; change checkOIDCDocumentsExist to return (bool, error)
instead of bool, have it return (false, nil) only for a genuine NotFound/404 and
return (false, err) for AccessDenied/network/transient errors so callers can
distinguish causes, then update checkOIDCState to handle the propagated error
from checkOIDCDocumentsExist (inspect error != nil to log/return the error or
abort rather than printing "documents missing") while preserving the existing
branches for oidcDocsExist and o.ForceRecreate; update any call sites of
checkOIDCDocumentsExist accordingly.

In `@Containerfile.cli`:
- Line 22: The Docker base image reference "FROM
registry.redhat.io/ubi9/nginx-124:latest" is using the mutable :latest tag;
update that FROM line to pin to a specific immutable version tag or image digest
(e.g., a released version tag or sha256 digest) so builds are reproducible and
stable—locate the FROM statement in Containerfile.cli and replace ":latest" with
the chosen version tag or digest, then commit the change and optionally add a
short comment noting the pinned version.

In `@contrib/repo_metrics/weekly_pr_report.py`:
- Around line 144-145: In _fetch_json replace the blocking time.sleep(30) with
an await asyncio.sleep(30) so the coroutine yields to the event loop; update the
call to await asyncio.sleep(30) in the same fallback path that currently does
return await self._fetch_json(...), and ensure asyncio is imported at the module
top if not already present.

In `@control-plane-operator/controllers/azureprivatelinkservice/controller.go`:
- Around line 863-1053: The new deletion helpers lack unit tests; add
table-driven unit tests for deleteDNSResources, deleteVNetLink, deleteDNSZone,
deleteBaseDomainResources, deleteBaseDomainARecords, and deletePrivateEndpoint
(and exercise hasSiblingCR behavior) that cover: (1) Azure API failures
returning errors and asserting the wrapper returns the expected formatted error,
(2) NotFound responses are treated as success (no error), (3) correct resource
name derivation by asserting the Azure client is called with expected names
(e.g., kasARecordName/appsARecordName,
vnetLinkName(crName)/baseDomainVNetLinkName, privateEndpointName(crName), and
base domain A record prefixes), and (4) sibling-CR detection flow where
deleteBaseDomainResources deletes or skips the base zone/records depending on
hasSiblingCR; implement mocks/stubs for RecordSets, VirtualNetworkLinks,
PrivateDNSZones, PrivateEndpoints, and the hasSiblingCR dependency and use
context timeouts/poller behavior to simulate BeginDelete/PollUntilDone success,
error, and nil-poller cases.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go`:
- Around line 61-72: The helper assertCertVolumeCount should fail fast when the
returned slice lengths don't match expectations to avoid downstream panics: in
function assertCertVolumeCount (which calls certVolumesFromMonitors) replace the
t.Errorf checks for volume/mount count mismatches with t.Fatalf (or otherwise
return immediately with t.Fatalf) so the test stops on precondition failures
rather than allowing callers to index invalid slices; keep the existing error
check for certVolumesFromMonitors as-is and ensure callers can rely on asserted
counts when the helper returns.

---

Nitpick comments:
In @.claude/commands/pr-report.md:
- Around line 732-755: Update the environment variables docs to mention the
alternative legacy names accepted by the script: add a brief note under the
variables table stating that JIRA_TOKEN/JIRA_EMAIL are equivalent to the older
JIRA_API_TOKEN/JIRA_USERNAME (the script checks both pairs around lines where
env vars are read), so existing configurations using JIRA_API_TOKEN or
JIRA_USERNAME will continue to work; reference the variable names JIRA_TOKEN,
JIRA_EMAIL, JIRA_API_TOKEN, and JIRA_USERNAME in the note.

In `@api/hypershift/v1beta1/hostedcluster_types.go`:
- Around line 676-681: The comment is ambiguous about how Secret data changes
"do not trigger that rollout" yet "will still propagate" for AWS/Azure Replace
NodePools; update the paragraph (the NodePool/pull secret description) to
explicitly state that “rollout” refers to NodePool replacement/in-place upgrade
actions (node recreation), and that updating the referenced Secret's data
in-place does not automatically trigger node replacement, but
cloud-provider-specific controllers or credential reconciliation (e.g., the
cloud provider credential controller or kubelet secret refresh mechanisms) will
propagate updated credentials to the guest cluster and kubelet config for
AWS/Azure Replace strategy; mention whether propagation is in-place (no node
replacement) and name the controller or mechanism if available, or note “a
separate controller/watch reconciler” if unspecified, so readers understand the
distinct behaviors for secret updates vs. NodePool rollouts.

In `@api/karpenter/v1/kubelet_config.go`:
- Line 50: The XValidation CEL expression on the kubebuilder annotation that
enforces evictionSoft >= evictionHard for matching signals is correct but
unreadable; add a clear inline comment directly above the annotation describing
the high-level intent (e.g., "Ensure evictionSoft thresholds are greater than or
equal to evictionHard for the same signal, handling percentage strings and
resource quantities") and briefly note the two cases handled (percentage strings
vs. resource quantities), or alternatively factor the check into multiple, named
XValidation annotations if feasible; reference the annotation (the
+kubebuilder:validation:XValidation line) and the related fields evictionSoft
and evictionHard so reviewers can find and understand the rule quickly.

In `@contrib/repo_metrics/weekly_pr_report.py`:
- Around line 965-973: The loop that pairs repos_to_fetch with results uses
zip(repos_to_fetch, results) and assumes equal lengths; change it to
zip(repos_to_fetch, results, strict=True) to enforce that invariant at runtime
and fail fast if lengths differ; update the loop in the block that assigns to
repo_fetch_status and repo_prs (the for (owner, name), result in zip(...) loop)
so it uses strict=True, ensuring any mismatch between repos_to_fetch and results
is caught immediately.
- Around line 402-403: BOT_PATTERNS and BOT_LOGINS are declared as mutable
list/set but never modified; change BOT_PATTERNS = ['-bot', '-robot', '[bot]']
to an immutable tuple BOT_PATTERNS = ('-bot', '-robot', '[bot]') and change
BOT_LOGINS = {'coderabbitai', 'hypershift-jira-solve-ci'} to a frozenset:
BOT_LOGINS = frozenset({'coderabbitai', 'hypershift-jira-solve-ci'}), and update
any membership checks (e.g., "in BOT_LOGINS" or iterating BOT_PATTERNS) which
will continue to work without other code changes.

In
`@control-plane-operator/controllers/azureprivatelinkservice/controller_test.go`:
- Around line 1747-1774: Remove the duplicated unit test
TestDNSZoneConfigErrMsgQualifier and keep the existing TestErrMsgQualifier to
avoid redundant coverage; locate the duplicate test function named
TestDNSZoneConfigErrMsgQualifier in the controller_test.go file (it constructs
dnsZoneConfig{logPrefix: ...} and calls cfg.errMsgQualifier()) and delete that
entire test block so only the original TestErrMsgQualifier remains exercising
dnsZoneConfig.errMsgQualifier.

In `@control-plane-operator/controllers/azureprivatelinkservice/controller.go`:
- Around line 863-881: The log for the case where azPLS.Status.DNSZoneName is
empty uses log.V(1).Info which may be too quiet for an unexpected cleanup
condition; update the logging in the controller deletion path (the dnsZoneName
check in the reconcile/finalizer routine that calls deleteDNSResources and
deleteBaseDomainResources) to use log.Info (or a higher visibility) instead of
log.V(1).Info so the message "DNSZoneName not set in status, skipping DNS
cleanup" is visible by default when DNSZoneName is missing during deletion.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 1543-1544: The log call in hostedcontrolplane_controller.go uses
the fmt verb "%w" in the message string (`r.Log.Error(err, "failed to remove
service ca annotation and secret: %w")`), which is meaningless for the logger
and will be printed literally; fix it by removing the "%w" from the message so
the error is passed as the first argument and logged correctly (e.g.,
`r.Log.Error(err, "failed to remove service ca annotation and secret")`), or
alternatively keep the error and add structured context via key/value pairs to
the same `r.Log.Error` call.
- Around line 1721-1729: The r.Log.Error calls use a printf-style %w placeholder
incorrectly and should be called as r.Log.Error(err, "failed to remove service
ca annotation and secret") (i.e., pass the error as the first argument to
r.Log.Error without formatting), and the reconciliation closures create an
unnecessary intermediate variable (z); replace constructs like "z :=
pki.ReconcileAzureDiskCsiDriverOperatorMetricsServingCertSecret(...); return z"
with "return
pki.ReconcileAzureDiskCsiDriverOperatorMetricsServingCertSecret(...)" inside the
createOrUpdate call (apply the same fixes for AzureDiskCsiDriverOperator* and
the analogous AzureFileCsiDriverOperator* usages, and for functions
removeServiceCAAnnotationAndSecret, createOrUpdate, and
pki.ReconcileAzure...ServingCertSecret).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 36192739-e1b9-421b-85fb-9abe5b9a217a

📥 Commits

Reviewing files that changed from the base of the PR and between 10fd799 and beea747.

⛔ Files ignored due to path filters (37)
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AAA_ungated.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ClusterUpdateAcceptRisks.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ClusterVersionOperatorConfiguration.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDC.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUIDAndExtraClaimMappings.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUpstreamParity.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/GCPPlatform.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/HCPEtcdBackup.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/HyperShiftOnlyDynamicResourceAllocation.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ImageStreamImportMode.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/KMSEncryptionProvider.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/OpenStack.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/TLSAdherence.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/karpenter/v1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • client/applyconfiguration/karpenter/v1/kubeletconfiguration.go is excluded by !client/**
  • client/applyconfiguration/karpenter/v1/openshiftec2nodeclassspec.go is excluded by !client/**
  • client/applyconfiguration/utils.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-Default.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_Private_KAS_LoadBalancer.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_Private_Route.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_PublicAndPrivate_Route.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_Public_Route.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/AROSwift/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/GCP/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • karpenter-operator/controllers/karpenter/assets/karpenter.hypershift.openshift.io_openshiftec2nodeclasses.yaml is excluded by !karpenter-operator/controllers/karpenter/assets/*.yaml
  • karpenter-operator/controllers/karpenter/assets/zz_generated.crd-manifests/openshiftec2nodeclasses.crd.yaml is excluded by !**/zz_generated.crd-manifests/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1/karpenter_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1/kubelet_config.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (156)
  • .claude/agents/api-sme.md
  • .claude/commands/pr-report.md
  • .claude/rules/webhook-validation.md
  • .github/workflows/dependabot-commit-fix-reusable.yaml
  • .github/workflows/docs-deploy.yaml
  • .golangci.yml
  • .tekton/hypershift-gomaxprocs-webhook-pull-request.yaml
  • .tekton/hypershift-gomaxprocs-webhook-push.yaml
  • AGENTS.md
  • CLAUDE.md
  • CLAUDE.md
  • Containerfile.cli
  • Containerfile.control-plane
  • Containerfile.operator
  • Makefile
  • api/.golangci.yml
  • api/AGENTS.md
  • api/CLAUDE.md
  • api/hypershift/v1beta1/hostedcluster_types.go
  • api/karpenter/v1/karpenter_types.go
  • api/karpenter/v1/kubelet_config.go
  • api/karpenter/v1/kubelet_config_test.go
  • cmd/cluster/core/create.go
  • cmd/cluster/core/create_test.go
  • cmd/fix/dr_oidc_iam.go
  • cmd/infra/aws/create.go
  • cmd/infra/azure/create.go
  • cmd/infra/gcp/iam-bindings.json
  • cmd/infra/gcp/iam.go
  • cmd/infra/gcp/iam_test.go
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/assets/hypershift_operator_test.go
  • cmd/install/install.go
  • cmd/install/install_render.go
  • cmd/install/install_render_test.go
  • cmd/install/install_test.go
  • contrib/ai/adding-marketplace-plugins.md
  • contrib/repo_metrics/weekly_pr_report.py
  • control-plane-operator/CLAUDE.md
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
  • control-plane-operator/controllers/azureprivatelinkservice/controller.go
  • control-plane-operator/controllers/azureprivatelinkservice/controller_test.go
  • control-plane-operator/controllers/gcpprivateserviceconnect/observer.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
  • control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go
  • control-plane-operator/controllers/hostedcontrolplane/kas/service.go
  • control-plane-operator/controllers/hostedcontrolplane/kas/service_test.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/control-plane-pki-operator/role.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/auth.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/auth_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/drainer/drainer.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/globalps/globalps.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/globalps/setup.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/globalps/setup_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/machine/machine.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/nodecount/controller.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/kas/admissionpolicies.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/registry/admissionpolicies.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
  • control-plane-operator/main.go
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go
  • docs/content/how-to/ci/docs-preview.md
  • docs/content/how-to/ci/github-actions.md
  • docs/content/how-to/common/global-pull-secret.md
  • docs/content/how-to/gcp/configure-image-registry.md
  • docs/content/how-to/gcp/create-gcp-hosted-cluster.md
  • docs/content/how-to/gcp/index.md
  • docs/content/how-to/kubevirt/configuring-vm-with-jsonpatch.md
  • docs/mkdocs.yml
  • hack/github-actions-runner/README.md
  • hack/github-actions-runner/cache-warming-cronjob.yaml
  • hack/kubelet-ratcheting-gen/main.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go
  • hypershift-operator/controllers/etcdbackup/reconciler.go
  • hypershift-operator/controllers/etcdbackup/reconciler_test.go
  • hypershift-operator/controllers/hostedcluster/etcd_recovery.go
  • hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/platform.go
  • hypershift-operator/controllers/hostedcluster/karpenter.go
  • hypershift-operator/controllers/hostedcluster/karpenter_test.go
  • hypershift-operator/controllers/hostedcluster/metrics/metrics.go
  • hypershift-operator/controllers/hostedcluster/network_policies.go
  • hypershift-operator/controllers/hostedcluster/network_policies_test.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_validation_controller.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/platform/aws/controller.go
  • hypershift-operator/controllers/platform/aws/controller_test.go
  • hypershift-operator/controllers/scheduler/aws/autoscaler.go
  • hypershift-operator/controllers/scheduler/aws/autoscaler_test.go
  • hypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes.go
  • hypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes_test.go
  • hypershift-operator/controllers/scheduler/azure/controller.go
  • hypershift-operator/main.go
  • ignition-server/controllers/local_ignitionprovider.go
  • ignition-server/controllers/local_ignitionprovider_test.go
  • karpenter-operator/controllers/karpenter/assets/tests/openshiftec2nodeclasses.karpenter.hypershift.openshift.io/stable.openshiftec2nodeclasses.kubelet-field-promotion.testsuite.yaml
  • karpenter-operator/controllers/karpenter/assets/tests/openshiftec2nodeclasses.karpenter.hypershift.openshift.io/stable.openshiftec2nodeclasses.kubelet-ratcheting.testsuite.yaml
  • karpenter-operator/controllers/karpenter/assets/tests/openshiftec2nodeclasses.karpenter.hypershift.openshift.io/stable.openshiftec2nodeclasses.kubelet.testsuite.yaml
  • karpenter-operator/controllers/karpenter/karpenter_controller.go
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go
  • karpenter-operator/controllers/nodeclass/ec2_nodeclass_controller.go
  • karpenter-operator/controllers/nodeclass/karpenter_util.go
  • karpenter-operator/controllers/nodeclass/karpenter_util_test.go
  • support/api/scheme.go
  • support/controlplane-component/CLAUDE.md
  • support/karpenter/karpenter.go
  • support/karpenter/karpenter_test.go
  • support/podspec/containers.go
  • support/podspec/containers_test.go
  • support/validations/authentication.go
  • support/validations/authentication_test.go
  • test/e2e/create_cluster_test.go
  • test/e2e/karpenter_kubelet_checker_pod.yaml
  • test/e2e/karpenter_test.go
  • test/e2e/util/dump/journals.go
  • test/e2e/util/hypershift_framework.go
  • test/e2e/util/reqserving/verifycp.go
  • test/e2e/util/reqserving/verifyenv.go
  • test/e2e/util/util.go
  • test/e2e/v2/backuprestore/cleanup.go
  • test/e2e/v2/cmd/create-guests/main.go
  • test/e2e/v2/internal/env_vars.go
  • test/e2e/v2/tests/backup_restore_test.go
  • test/e2e/v2/tests/control_plane_upgrade_test.go
  • test/e2e/v2/tests/etcd_chaos_test.go
  • test/e2e/v2/tests/hosted_cluster_image_registry_test.go
  • test/e2e/v2/tests/nodepool_autoscaling_test.go
  • test/e2e/v2/tests/nodepool_lifecycle_test.go
  • test/integration/control_plane_pki_operator.go
💤 Files with no reviewable changes (9)
  • control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/drainer/drainer.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/machine/machine.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/registry/admissionpolicies.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/nodecount/controller.go
  • control-plane-operator/controllers/gcpprivateserviceconnect/observer.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/globalps/globalps.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/kas/admissionpolicies.go

Comment thread .claude/commands/pr-report.md
Comment thread api/.golangci.yml
Comment thread api/karpenter/v1/kubelet_config.go
Comment thread cmd/cluster/core/create.go
Comment thread cmd/fix/dr_oidc_iam.go
Comment thread cmd/fix/dr_oidc_iam.go
Comment thread Containerfile.cli
Comment thread contrib/repo_metrics/weekly_pr_report.py
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 15, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

Add Label("lifecycle") to all lifecycle test Describes so general v2
jobs can exclude them with a single !lifecycle filter instead of
enumerating each label individually.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bryan-cox
Copy link
Copy Markdown
Member Author

/pipeline required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@clebs
Copy link
Copy Markdown
Member

clebs commented May 21, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057393342471213056 | Cost: $3.3062020000000003 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-aks

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2057393352541736960 | Cost: $4.972050499999999 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-azure-self-managed

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-aws

@bryan-cox
Copy link
Copy Markdown
Member Author

/override "Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main"

@bryan-cox
Copy link
Copy Markdown
Member Author

/override "Red Hat Konflux / hypershift-operator-enterprise-contract / hypershift-operator-main"

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@bryan-cox: Overrode contexts on behalf of bryan-cox: Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main

Details

In response to this:

/override "Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bryan-cox
Copy link
Copy Markdown
Member Author

/verified by e2e

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 21, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@bryan-cox: Overrode contexts on behalf of bryan-cox: Red Hat Konflux / hypershift-operator-enterprise-contract / hypershift-operator-main

Details

In response to this:

/override "Red Hat Konflux / hypershift-operator-enterprise-contract / hypershift-operator-main"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

One more time

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2057548713995276288 | Cost: $3.4162894999999995 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 22, 2026

@bryan-cox: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws d428dc1 link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 22, 2026

Now I have all the evidence I need. Let me compile the final report:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Failed to wait for 1 nodes to become ready in 45m0s: context deadline exceeded
observed invalid **v1.Node state after 45m0s
 - observed **v1.Node /ip-10-0-143-79.ec2.internal invalid:
   expected node OS image "Red Hat Enterprise Linux CoreOS 9.8.20260519-0 (Plow)"
   to contain one of [9.8.20260520-0 10.2.20260521-0]

Summary

The TestKarpenterUpgradeControlPlane test failed because Karpenter did not replace a drifted node within the 45-minute timeout. After a successful control plane upgrade, Karpenter correctly detected drift on NodeClaim on-demand-mb596 (in 3s) and the control plane rolled out successfully (in 22m39s), but the old node ip-10-0-143-79.ec2.internal was never replaced — it remained on OS version 9.8.20260519-0 instead of upgrading to one of [9.8.20260520-0, 10.2.20260521-0]. This is a pre-existing Karpenter node replacement issue, not caused by the PR's Ginkgo v2 porting changes (the PR does not touch Karpenter test code). The two Konflux enterprise-contract failures are also unrelated to the PR — they are pre-existing infrastructure checks that consistently show NEUTRAL on all recently merged PRs (they are not required gates).

Root Cause

e2e-aws (TestKarpenterUpgradeControlPlane): The root cause is a Karpenter node replacement failure. The test sequence was:

  1. ✅ Created hosted cluster with Karpenter NodePool (20s)
  2. ✅ 1 node became ready (7m42s) — node ip-10-0-143-79.ec2.internal running RHCOS 9.8.20260519-0
  3. ✅ Triggered control plane upgrade to a release containing OS versions [9.8.20260520-0, 10.2.20260521-0]
  4. ✅ Karpenter detected drift on NodeClaim on-demand-mb596 (3s)
  5. ✅ Control plane rollout completed (22m39s)
  6. Node was never replaced — waited 45 minutes but the old node stayed on OS 9.8.20260519-0

Karpenter's drift detection worked correctly, but its node replacement/disruption mechanism failed to act on the drift within 45 minutes. The node remained healthy (Ready=True, no pressure conditions) throughout, suggesting Karpenter never initiated the disruption/replacement workflow despite marking the NodeClaim as drifted. This could be caused by:

  • Disruption budget constraints preventing replacement
  • AWS capacity issues preventing a new node from being provisioned
  • A race condition in Karpenter's disruption controller
  • Consolidation or scheduling constraints blocking the replacement

This failure is unrelated to PR #8527's changes. The PR modifies test/e2e/v2/ lifecycle tests (control plane upgrade, etcd chaos, nodepool autoscaling/lifecycle) and does not touch test/e2e/karpenter_control_plane_upgrade_test.go or any Karpenter-related code.

Konflux enterprise-contract failures: Both Konflux checks (hypershift-operator-main-enterprise-contract and hypershift-operator-enterprise-contract) failed with 2 out of 256 policy checks failing. These are pre-existing failures unrelated to the PR — all recently merged PRs (#8563, #8561, #8557, #8552, #8532, #8519) show these same checks as NEUTRAL (not blocking), confirming they are not required merge gates and represent a known ongoing issue in the Konflux enterprise-contract configuration.

Recommendations
  1. Retest the e2e-aws job — The TestKarpenterUpgradeControlPlane failure is a Karpenter node replacement issue unrelated to this PR's changes. A /retest should confirm this is a flaky/transient failure.

  2. Ignore the Konflux enterprise-contract failures — These are pre-existing infrastructure issues that affect all PRs equally (all recently merged PRs show NEUTRAL for these checks). They are not merge-blocking gates.

  3. If the Karpenter test fails again on retry, investigate whether there's a systemic issue with Karpenter's disruption controller not acting on drifted NodeClaims in the CI environment. Check Karpenter controller logs for disruption budget violations or scheduling constraints.

  4. No code changes needed for this PR — The Ginkgo v2 porting changes do not affect the failing test.

Evidence
Evidence Detail
Failing test TestKarpenterUpgradeControlPlane/Main — 2 failures out of 600 tests
Test duration 4533.83s (75 min) for Main subtest, 6229.74s (103 min) total
Drift detected NodeClaim on-demand-mb596 marked drifted in 3s ✅
Control plane rollout Completed in 22m39s ✅
Node not replaced ip-10-0-143-79.ec2.internal stayed on OS 9.8.20260519-0 for 45 min
Expected OS versions [9.8.20260520-0, 10.2.20260521-0]
Node health Ready=True, no pressure conditions — node was healthy
PR files changed test/e2e/v2/ only — no Karpenter test code modified
Karpenter test file test/e2e/karpenter_control_plane_upgrade_test.go — not in PR diff
Konflux check 1 hypershift-operator-main-enterprise-contract: 254 pass, 24 warn, 2 fail
Konflux check 2 hypershift-operator-enterprise-contract: 254 pass, 24 warn, 2 fail
Konflux on merged PRs All show NEUTRAL — confirms pre-existing, not PR-caused
CI step failed e2e-aws-hypershift-aws-run-e2e-nested after 1h44m24s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ai Indicates the PR includes changes related to AI - Claude agents, Cursor rules, etc. area/api Indicates the PR includes changes for the API area/ci-tooling Indicates the PR includes changes for CI or tooling area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/control-plane-pki-operator Indicates the PR includes changes for the control plane PKI operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants