OCPBUGS-77268: reconcile HCP when pull secret is unavailable#8352
OCPBUGS-77268: reconcile HCP when pull secret is unavailable#8352rafael-azevedo wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughWhen pull-secret fetching fails during HostedCluster reconciliation, the controller logs the failure and attempts a recovery reconcile of the HostedControlPlane. The recovery recomputes annotation-driven state (autoscaling, AWS node-termination handler, and KAS serving-cert hashes) while disabling stale-certificate checks, then performs a createOrUpdate of the HostedControlPlane with those annotations. If annotation computation fails, that error is returned; if the HostedControlPlane reconcile fails it is logged but the original pull-secret error is still returned. New unit and e2e tests cover missing or corrupted pull-secret scenarios. Sequence Diagram(s)sequenceDiagram
participant Reconciler
participant API_Server as "API Server (Secrets/CRs)"
participant AnnotationLogic as "Annotation Computation"
participant HostedControlPlane as "HostedControlPlane (createOrUpdate)"
Reconciler->>API_Server: Get pull secret
API_Server-->>Reconciler: error (pull secret missing/corrupt)
Reconciler-->>Reconciler: log "failed to get pull secret"
Reconciler->>AnnotationLogic: compute autoscaling/node-termination/cert-hash annotations (disable stale cert check)
AnnotationLogic-->>Reconciler: annotations or error
alt annotations computed
Reconciler->>HostedControlPlane: reconcile HostedControlPlane with annotations
HostedControlPlane-->>Reconciler: success or error (logged)
Reconciler-->>Reconciler: return original pull-secret error
else annotation computation failed
AnnotationLogic-->>Reconciler: error
Reconciler-->>Reconciler: return annotation error
end
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (3 warnings, 1 inconclusive)
✅ Passed checks (8 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8352 +/- ##
==========================================
+ Coverage 40.40% 40.43% +0.02%
==========================================
Files 755 755
Lines 93235 93256 +21
==========================================
+ Hits 37675 37708 +33
+ Misses 52858 52850 -8
+ Partials 2702 2698 -4
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go (1)
668-693: Reduce duplication and avoid shadowing the outerhcp.The new fallback block largely repeats the HCP-reconcile logic at lines 1825–1844 (autoscaler check, NTH check,
createOrUpdate+reconcileHostedControlPlanewith the same cert-renewal closure). It also redeclareshcpwith:=(line 672), shadowing the outerhcpfrom line 385; the main path at line 1833 uses=to reassign. Consider extracting a small helper (e.g.,r.reconcileHostedControlPlaneObject(ctx, log, hcluster, controlPlaneNamespace, defaultToControlPlaneV2)) and calling it from both sites — this would also make it easier to keep the V2 flag and error-aggregation behavior consistent across paths.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go` around lines 668 - 693, The fallback error path duplicates the HCP-reconcile logic and shadows the outer hcp; extract a helper method on the reconciler (e.g., r.reconcileHostedControlPlaneObject(ctx, log, hcluster, controlPlaneNamespace, defaultToControlPlaneV2)) that performs the autoscaler and NTH checks (isAutoscalingNeeded, isAWSNodeTerminationHandlerNeeded), calls createOrUpdate and reconcileHostedControlPlane with the same cert-renewal closure (shouldCheckForStaleCerts, kasServingCertHashFromSecret, kasServingCertHashFromEndpoint, kasHostAndPortFromHCP), and returns any errors; replace the duplicated block with a call to that helper and ensure you assign to the existing hcp variable (use = not :=) so you don’t shadow the outer variable and so V2 flag and error aggregation remain consistent across both code paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 342-343: The test currently only asserts the presence of the
annotation on updatedHCP using
HaveKey(hyperv1.RequestServingNodeAdditionalSelectorAnnotation); update the
assertion to verify the annotation value as well (e.g., use Gomega's
HaveKeyWithValue or compare
updatedHCP.Annotations[hyperv1.RequestServingNodeAdditionalSelectorAnnotation]
to the expected selector string/variable). Locate the assertion around
updatedHCP and replace the HaveKey check with a value equality check referencing
hyperv1.RequestServingNodeAdditionalSelectorAnnotation and the expected selector
used earlier in the test.
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 689-692: When hcpErr is non-nil the code currently returns the
wrapped HCP error and drops the original pull-secret error (err); change this so
the controller still returns the original err for requeue while preserving or
surfacing hcpErr: log hcpErr (or aggregate it into the returned error while
keeping err as the primary error) instead of returning only hcpErr. Locate the
block using variables hcpErr and err (in the hostedcluster reconcile flow) and
replace the return so you either call the controller logger to record hcpErr or
return a combined error that wraps hcpErr but keeps err as the returned value
(e.g., log hcpErr then return err).
- Line 685: The call currently uses a hardcoded false
(shouldCheckForStaleCerts(hcluster, false)) which forces stale-cert checks on V2
clusters during a pull-secret error path; instead, on this error path skip the
stale-cert check entirely: detect the pull-secret-unavailable condition where
this line runs and do not call shouldCheckForStaleCerts or the subsequent
kasServingCertHashFromSecret/kasServingCertHashFromEndpoint work; otherwise,
when the pull secret is available, obtain the correct defaultToControlPlaneV2
value (derive it the same way the main reconcile does from the CPO image labels
or HCP annotations) and pass that into shouldCheckForStaleCerts so behavior
matches normal reconciliation.
In `@test/e2e/chaos_test.go`:
- Around line 500-504: Replace the weak corruption payload which may be treated
as a benign auth config by assigning a structurally-invalid but still
syntactically valid docker config JSON to pullSecret.Data[".dockerconfigjson"];
specifically, change the value written before calling mgtClient.Update so the
JSON uses incorrect structure/types for a docker config (for example an "auths"
field with the wrong value type) to reliably trigger the pull-secret failure
path in the chaos_test where pullSecret is updated via mgtClient.Update.
- Around line 509-526: The current assertion only checks ReleaseImage and can
pass without reconcileHostedControlPlane running; after corrupting the pull
secret, mutate a field on the source HostedCluster that is known to be
propagated to the HostedControlPlane (e.g. update
hostedCluster.Spec.<propagatedField>), then use the existing
e2eutil.EventuallyObject / predicate flow that reads the HostedControlPlane (h
:= &hyperv1.HostedControlPlane{} via mgtClient.Get) to assert that
h.Spec.<propagatedField> equals the new value, ensuring
reconcileHostedControlPlane actually applied the update.
---
Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 668-693: The fallback error path duplicates the HCP-reconcile
logic and shadows the outer hcp; extract a helper method on the reconciler
(e.g., r.reconcileHostedControlPlaneObject(ctx, log, hcluster,
controlPlaneNamespace, defaultToControlPlaneV2)) that performs the autoscaler
and NTH checks (isAutoscalingNeeded, isAWSNodeTerminationHandlerNeeded), calls
createOrUpdate and reconcileHostedControlPlane with the same cert-renewal
closure (shouldCheckForStaleCerts, kasServingCertHashFromSecret,
kasServingCertHashFromEndpoint, kasHostAndPortFromHCP), and returns any errors;
replace the duplicated block with a call to that helper and ensure you assign to
the existing hcp variable (use = not :=) so you don’t shadow the outer variable
and so V2 flag and error aggregation remain consistent across both code paths.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: c1d7ffc1-6089-4045-b310-bed495c8ac4a
📒 Files selected for processing (3)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.gotest/e2e/chaos_test.go
|
/test e2e-aws |
f5ff5b8 to
adcbbe4
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (2)
309-310: Nit:defer mockCtrl.Finish()is redundant.
gomock.NewController(t)registers at.CleanupforFinish()since gomock v1.5.0, so the explicit defer is dead code. Other tests in this file have already dropped it (e.g.TestHasBeenAvailableat Line 110,TestValidateReleaseImageat Line 2355). Drop for consistency.mockCtrl := gomock.NewController(t) - defer mockCtrl.Finish() mockedProvider := releaseinfo.NewMockProviderWithOpenShiftImageRegistryOverrides(mockCtrl)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go` around lines 309 - 310, Remove the redundant explicit defer call to mockCtrl.Finish() after creating the gomock controller with mockCtrl := gomock.NewController(t); NewController already registers t.Cleanup to call Finish(), so delete the line "defer mockCtrl.Finish()" in the test (the mockCtrl variable and its NewController(t) call should remain).
309-311: Mock provider has no expectations — fine today, but brittle.
mockedProvideris constructed without anyEXPECT().Lookup(...).AnyTimes()(unlike every other test in this file). This works only as long as the missing-pull-secret path errors out before any call into the release provider. If a future change reorders validation or release-image lookup ahead ofGetPullSecretBytes, this test will start panicking with a gomock "unexpected call" rather than failing meaningfully.Adding a permissive default would make the test resilient:
Suggested hardening
mockedProvider := releaseinfo.NewMockProviderWithOpenShiftImageRegistryOverrides(mockCtrl) + mockedProvider.EXPECT(). + Lookup(gomock.Any(), gomock.Any(), gomock.Any()). + Return(testutils.InitReleaseImageOrDie("4.15.0"), nil).AnyTimes() + mockedProvider.EXPECT().GetRegistryOverrides().Return(nil).AnyTimes() + mockedProvider.EXPECT().GetOpenShiftImageRegistryOverrides().Return(nil).AnyTimes() + mockedProvider.EXPECT().GetMirroredReleaseImage().Return("").AnyTimes()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go` around lines 309 - 311, The test constructs mockedProvider via releaseinfo.NewMockProviderWithOpenShiftImageRegistryOverrides but sets no default expectations, which makes the test brittle; add a permissive default expectation on mockedProvider.Lookup (or the Lookup method used by release provider) using mockedProvider.EXPECT().Lookup(gomock.Any(), gomock.Any()).AnyTimes().Return(<sensible default return values>, nil) so any unexpected lookup calls won't cause gomock panics; place this right after the mockedProvider creation near the mockCtrl setup and ensure the returned values match the signature expected by code paths (so GetPullSecretBytes and other validation can run without triggering unexpected-call errors).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 309-310: Remove the redundant explicit defer call to
mockCtrl.Finish() after creating the gomock controller with mockCtrl :=
gomock.NewController(t); NewController already registers t.Cleanup to call
Finish(), so delete the line "defer mockCtrl.Finish()" in the test (the mockCtrl
variable and its NewController(t) call should remain).
- Around line 309-311: The test constructs mockedProvider via
releaseinfo.NewMockProviderWithOpenShiftImageRegistryOverrides but sets no
default expectations, which makes the test brittle; add a permissive default
expectation on mockedProvider.Lookup (or the Lookup method used by release
provider) using mockedProvider.EXPECT().Lookup(gomock.Any(),
gomock.Any()).AnyTimes().Return(<sensible default return values>, nil) so any
unexpected lookup calls won't cause gomock panics; place this right after the
mockedProvider creation near the mockCtrl setup and ensure the returned values
match the signature expected by code paths (so GetPullSecretBytes and other
validation can run without triggering unexpected-call errors).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d47c7540-abd2-44c5-ac0a-8d68c70d3c46
📒 Files selected for processing (3)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.gotest/e2e/chaos_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- test/e2e/chaos_test.go
- hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
|
/test e2e-aws |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
Test Resultse2e-aws
|
adcbbe4 to
831514a
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go (1)
681-687:⚠️ Potential issue | 🟠 MajorSkip stale-cert checks on the pull-secret fallback path.
Line 685 still hardcodes
falseintoshouldCheckForStaleCerts, so this recovery path can run cert-hash logic for already-available clusters during a pull-secret outage. That makes the fallback reconcile depend on unrelated secret/endpoint checks and can even stamp a restart annotation, which defeats the goal of “best-effort HCP propagation while the pull secret is broken.”🛠️ Suggested change
_, hcpErr := createOrUpdate(ctx, r.Client, hcp, func() error { - return reconcileHostedControlPlane(hcp, hcluster, isAutoscalingNeeded, isAWSNodeTerminationHandlerNeeded, - annotationsForCertRenewal(log, - hcp, - shouldCheckForStaleCerts(hcluster, false), - r.kasServingCertHashFromSecret(ctx, hcp), - r.kasServingCertHashFromEndpoint(kasHostAndPortFromHCP(hcp)))) + return reconcileHostedControlPlane( + hcp, + hcluster, + isAutoscalingNeeded, + isAWSNodeTerminationHandlerNeeded, + func() (map[string]string, error) { return nil, nil }, + ) })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go` around lines 681 - 687, The pull-secret fallback path is still calling shouldCheckForStaleCerts with a hardcoded false which causes cert-hash logic to run; update the call inside the createOrUpdate reconcile for the HostedControlPlane (the reconcileHostedControlPlane(...) invocation that builds annotationsForCertRenewal) to pass the flag that skips stale-cert checks (call shouldCheckForStaleCerts(hcluster, true) instead of false) so the fallback path does not perform kasServingCertHashFromSecret/kasServingCertHashFromEndpoint checks or stamp restart annotations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/chaos_test.go`:
- Around line 462-528: The test only mutates and asserts Spec.NodeSelector but
the regression covers the request-serving selector annotation; update the test
to also mutate hostedCluster.ObjectMeta.Annotations (set the
RequestServingNodeAdditionalSelector annotation key to a test value) right after
changing Spec.NodeSelector, then add an EventuallyObject check (similar to the
HCP NodeSelector check) that fetches the HostedControlPlane and verifies the
propagated RequestServingNodeAdditionalSelector (or the HCP field that holds
that selector) contains the test value; modify the predicate used in
e2eutil.EventuallyObject (or add a new call) to look for the annotation/field on
the HostedControlPlane and fail the test until propagation is observed.
- Around line 479-545: After capturing the original pull secret data (variable
pullSecret and originalData) and before mutating anything, register a t.Cleanup
closure that restores pullSecret.Data[".dockerconfigjson"] using originalData
and reverts the hostedCluster.Spec.NodeSelector change (capture a copy of the
original nodeSelector map or its value for key "ocpbugs-77268-test"), and have
the closure call mgtClient.Get/Update to restore both the secret and the
hostedCluster NodeSelector; this ensures rollback runs on any test failure—refer
to pullSecret, originalData, hostedCluster.Spec.NodeSelector, and the
NodeSelector key "ocpbugs-77268-test" to locate where to add the t.Cleanup.
---
Duplicate comments:
In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 681-687: The pull-secret fallback path is still calling
shouldCheckForStaleCerts with a hardcoded false which causes cert-hash logic to
run; update the call inside the createOrUpdate reconcile for the
HostedControlPlane (the reconcileHostedControlPlane(...) invocation that builds
annotationsForCertRenewal) to pass the flag that skips stale-cert checks (call
shouldCheckForStaleCerts(hcluster, true) instead of false) so the fallback path
does not perform kasServingCertHashFromSecret/kasServingCertHashFromEndpoint
checks or stamp restart annotations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: e11a95e2-637a-4ed4-89d6-541e854c5b50
📒 Files selected for processing (3)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.gotest/e2e/chaos_test.go
831514a to
804ff63
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
test/e2e/chaos_test.go (2)
462-465:⚠️ Potential issue | 🟡 MinorComment says annotation is validated, but the test never asserts it.
Lines 462-465 state this covers
RequestServingNodeAdditionalSelector, but the body only mutates/assertsSpec.NodeSelector. This can stay green while annotation propagation regresses.Proposed test hardening
if hostedCluster.Spec.NodeSelector == nil { hostedCluster.Spec.NodeSelector = map[string]string{} } + if hostedCluster.Annotations == nil { + hostedCluster.Annotations = map[string]string{} + } hostedCluster.Spec.NodeSelector["ocpbugs-77268-test"] = "true" + hostedCluster.Annotations[hyperv1.RequestServingNodeAdditionalSelectorAnnotation] = `{"topology.kubernetes.io/zone":"us-east-1a"}` err = mgtClient.Update(ctx, hostedCluster) g.Expect(err).NotTo(HaveOccurred(), "failed to update hostedcluster NodeSelector") @@ func(h *hyperv1.HostedControlPlane) (done bool, reasons string, err error) { val, ok := h.Spec.NodeSelector["ocpbugs-77268-test"] if !ok || val != "true" { return false, "NodeSelector not yet propagated to HCP", nil } + if h.Annotations[hyperv1.RequestServingNodeAdditionalSelectorAnnotation] != hostedCluster.Annotations[hyperv1.RequestServingNodeAdditionalSelectorAnnotation] { + return false, "RequestServingNodeAdditionalSelector not yet propagated to HCP", nil + } return true, "NodeSelector propagated to HCP", nil },Also applies to: 495-528
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/chaos_test.go` around lines 462 - 465, The test TestPullSecretUnavailable mentions RequestServingNodeAdditionalSelector but never mutates or asserts it—update the test to mutate the HostedCluster's RequestServingNodeAdditionalSelector (in addition to Spec.NodeSelector) and then assert that the HostedControlPlane receives the expected annotation/selector value; ensure you change the same pattern in the other similar block (the duplicated assertion block) so both Spec.NodeSelector and RequestServingNodeAdditionalSelector are modified and verified for propagation to the HostedControlPlane.
479-545:⚠️ Potential issue | 🟠 MajorRegister rollback with
t.Cleanupbefore mutating resources.If any assertion fails before Line 530, cleanup won’t run and the test can leave the pull secret corrupted.
Suggested fix
originalData := pullSecret.Data[".dockerconfigjson"] g.Expect(originalData).NotTo(BeEmpty(), "pull secret should have .dockerconfigjson data") + originalNodeSelector := map[string]string{} + for k, v := range hostedCluster.Spec.NodeSelector { + originalNodeSelector[k] = v + } + + t.Cleanup(func() { + latestSecret := &corev1.Secret{} + if err := mgtClient.Get(ctx, crclient.ObjectKey{ + Namespace: hostedCluster.Namespace, + Name: hostedCluster.Spec.PullSecret.Name, + }, latestSecret); err == nil { + latestSecret.Data[".dockerconfigjson"] = originalData + _ = mgtClient.Update(ctx, latestSecret) + } + + latestHC := &hyperv1.HostedCluster{} + if err := mgtClient.Get(ctx, crclient.ObjectKeyFromObject(hostedCluster), latestHC); err == nil { + latestHC.Spec.NodeSelector = originalNodeSelector + _ = mgtClient.Update(ctx, latestHC) + } + })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/chaos_test.go` around lines 479 - 545, Register a rollback with t.Cleanup before you corrupt the pull secret so restoration always runs: capture originalData and the hostedCluster reference, then call t.Cleanup(func(){ fetch the latest pullSecret and hostedCluster via mgtClient using ctx, restore pullSecret.Data[".dockerconfigjson"]=originalData and update it, remove the test key from hostedCluster.Spec.NodeSelector and update it; log or ignore errors inside cleanup to avoid masking test failures }). Do this immediately after reading originalData and before mutating pullSecret or hostedCluster (references: pullSecret, originalData, hostedCluster, mgtClient, ctx, hostedCluster.Spec.NodeSelector, t.Cleanup).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/e2e/chaos_test.go`:
- Around line 462-465: The test TestPullSecretUnavailable mentions
RequestServingNodeAdditionalSelector but never mutates or asserts it—update the
test to mutate the HostedCluster's RequestServingNodeAdditionalSelector (in
addition to Spec.NodeSelector) and then assert that the HostedControlPlane
receives the expected annotation/selector value; ensure you change the same
pattern in the other similar block (the duplicated assertion block) so both
Spec.NodeSelector and RequestServingNodeAdditionalSelector are modified and
verified for propagation to the HostedControlPlane.
- Around line 479-545: Register a rollback with t.Cleanup before you corrupt the
pull secret so restoration always runs: capture originalData and the
hostedCluster reference, then call t.Cleanup(func(){ fetch the latest pullSecret
and hostedCluster via mgtClient using ctx, restore
pullSecret.Data[".dockerconfigjson"]=originalData and update it, remove the test
key from hostedCluster.Spec.NodeSelector and update it; log or ignore errors
inside cleanup to avoid masking test failures }). Do this immediately after
reading originalData and before mutating pullSecret or hostedCluster
(references: pullSecret, originalData, hostedCluster, mgtClient, ctx,
hostedCluster.Spec.NodeSelector, t.Cleanup).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: fd265325-9904-41af-8a72-f07c0845e64d
📒 Files selected for processing (3)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.gotest/e2e/chaos_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
804ff63 to
508ff9e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/chaos_test.go`:
- Around line 493-495: The test currently saves and later restores entire maps
via maps.Clone(hostedCluster.Spec.NodeSelector) and
maps.Clone(hostedCluster.Annotations), which can wipe other reconcilers'
updates; change this to capture only the test-owned keys (the node selector key
"ocpbugs-77268-test" in hostedCluster.Spec.NodeSelector and the annotation key
RequestServingNodeAdditionalSelectorAnnotation in hostedCluster.Annotations),
and in both the cleanup block and the explicit recovery path update the cluster
by setting or deleting only those individual keys (restore the saved value if
present, or delete the key if it was absent) instead of replacing the whole map;
locate uses of hostedCluster, maps.Clone, and the cleanup/recovery blocks and
modify them accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 46ca9f78-947c-4c96-a2b7-42876b92bf8d
📒 Files selected for processing (3)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.gotest/e2e/chaos_test.go
✅ Files skipped from review due to trivial changes (1)
- hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
508ff9e to
b3bb195
Compare
|
/retest |
|
/test e2e-aws |
2123994 to
058d1c8
Compare
|
/test e2e-aws |
|
/retest |
058d1c8 to
9ae4079
Compare
|
/test e2e-aws |
9ae4079 to
f981c59
Compare
|
/test e2e-aws |
cblecker
left a comment
There was a problem hiding this comment.
Review of the test changes — a few suggestions on the e2e cleanup pattern and unit test assertions.
| originalAnnotationVal, hadAnnotation := hostedCluster.Annotations[hyperv1.RequestServingNodeAdditionalSelectorAnnotation] | ||
|
|
||
| // Ensure pull secret and test-owned keys are restored even if the test fails mid-way | ||
| t.Cleanup(func() { |
There was a problem hiding this comment.
The t.Cleanup here and the manual restoration at the end of the test body (lines 616-640) both restore the same state. Do we need both? Would it work to consolidate into just t.Cleanup (adding the recovery EventuallyObject wait there too) to avoid doing the restoration twice on the happy path?
|
|
||
| // When the pull secret is missing, reconciliation should return an error for requeue | ||
| g.Expect(err).To(HaveOccurred(), "When pull secret is missing it should return an error") | ||
| g.Expect(err.Error()).To(ContainSubstring("pull secret"), "When pull secret is missing the error should mention pull secret") |
There was a problem hiding this comment.
nit: The controller wraps this as "failed to get pull secret: %w". A tighter assertion would avoid false positives if a different code path happens to mention "pull secret" in its error:
| g.Expect(err.Error()).To(ContainSubstring("pull secret"), "When pull secret is missing the error should mention pull secret") | |
| g.Expect(err.Error()).To(ContainSubstring("failed to get pull secret"), "When pull secret is missing the error should mention pull secret") |
| // autoscaler should be disabled. This validates the defaulting logic in the | ||
| // pull secret recovery path. | ||
| g.Expect(updatedHCP.Annotations).To(HaveKeyWithValue( | ||
| hyperv1.DisableClusterAutoscalerAnnotation, "true"), |
There was a problem hiding this comment.
This validates the happy path (no error, no NodePools → autoscaling not needed). But the recovery path also has error-defaulting logic that defaults isAutoscalingNeeded = true and isAWSNodeTerminationHandlerNeeded = true on error. Would it be worth adding a test case that exercises those defaults?
f981c59 to
cfdbf0f
Compare
|
/test e2e-aws |
|
/jira refresh |
|
@cblecker: This pull request references Jira Issue OCPBUGS-77268, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@cblecker: This pull request references Jira Issue OCPBUGS-77268, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (jiezhao@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
|
Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage. |
|
/pipeline required |
|
Scheduling tests matching the |
|
/test e2e-aws |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
Now I have the complete picture. Let me generate the final report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryAll 41 failures are pre-existing flakes unrelated to PR #8352. The new test added by this PR ( Root CauseThe failures are caused by management cluster kube-apiserver instability during the test run, which is an infrastructure-level flake unrelated to PR #8352. Evidence of infrastructure instability:
EnsureNoCrashingPods failures (11 test suites): ValidateHostedCluster timeouts (12 test suites): PR #8352 is not the cause:
Recommendations
Evidence
|
When the pull secret is missing or invalid, the HostedCluster reconciler exits before reaching reconcileHostedControlPlane. This prevents RequestServingNodeAdditionalSelector and other spec fields from being propagated to the HCP, breaking HA scheduling for request-serving nodes. Log the error and call reconcileHostedControlPlane before returning so pod placement decisions continue during a pull secret outage. The error is still returned to trigger a requeue for full reconciliation once the pull secret is restored. JIRA: OCPBUGS-77268
cfdbf0f to
f1af3b1
Compare
|
New changes are detected. LGTM label has been removed. |
|
@rafael-azevedo: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What this PR does / why we need it:
When the pull secret is missing or invalid, the HostedCluster reconciler exits before reaching reconcileHostedControlPlane. This prevents RequestServingNodeAdditionalSelector and other spec fields from being propagated to the HCP, breaking HA scheduling for request-serving nodes.
Log the error and call reconcileHostedControlPlane before returning so pod placement decisions continue during a pull secret outage. The error is still returned to trigger a requeue for full reconciliation once the pull secret is restored.
JIRA: OCPBUGS-77268
Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/OCPBUGS-77268
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit
Bug Fixes
Tests