Skip to content

OCPBUGS-84955: bootstrap serving certs at hypershift operator startup#8545

Open
clebs wants to merge 3 commits into
openshift:mainfrom
clebs:bootstrap-certs
Open

OCPBUGS-84955: bootstrap serving certs at hypershift operator startup#8545
clebs wants to merge 3 commits into
openshift:mainfrom
clebs:bootstrap-certs

Conversation

@clebs
Copy link
Copy Markdown
Member

@clebs clebs commented May 19, 2026

What this PR does / why we need it:

The webhook server requires TLS certs on disk before it can start listening. Previously, certs were only created via hypershift install manifests, meaning the operator would fail to start if the serving cert secret was not yet present like it is the case when running hyperhsift install render without the --render-sensitive flag.

This is used in a scenario where the rendered manifests are pushed to a gitops workflow and, therefore, secrets can not be rendered.

By adding bootstrap cert generation at startup, the operator is self-sufficient: if the secret exists, the volume mount delivers certs normally; if it is missing or empty, certs are generated, persisted, and written to disk. The secret volume is now marked optional so the pod can start without it.

Which issue(s) this PR fixes:

Followup to OCPBUGS-84955

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

    • Operator can bootstrap webhook CA and serving certificates at startup when a certificate directory is configured; installer no longer embeds or propagates webhook CA/certs into rendered assets.
    • Webhook configurations no longer include an embedded CA bundle.
  • Bug Fixes

    • Webhook serving-certificate secret is now optional so the operator can start without pre-provisioned certs.
  • Tests

    • Added tests covering certificate bootstrapping, creation, regeneration, and existing-secret behaviors.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 19, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@clebs: This pull request references Jira Issue OCPBUGS-84955, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is MODIFIED instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

The webhook server requires TLS certs on disk before it can start listening. Previously, certs were only created via hypershift install manifests, meaning the operator would fail to start if the serving cert secret was not yet present like it is the case when running hyperhsift install render without the --render-sensitive flag.

This is used in a scenario where the rendered manifests are pushed to a gitops workflow and, therefore, secrets can not be rendered.

By adding bootstrap cert generation at startup, the operator is self-sufficient: if the secret exists, the volume mount delivers certs normally; if it is missing or empty, certs are generated, persisted, and written to disk. The secret volume is now marked optional so the pod can start without it.

Which issue(s) this PR fixes:

Followup to OCPBUGS-84955

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR adds EnsureWebhookCerts to bootstrap webhook CA and serving TLS Secrets and invokes it at operator startup when a cert directory is configured. Installer-side generation and propagation of webhook CA bundles is removed: webhook asset structs drop CABundle and built webhook ClientConfig.CABundle values are nil. The manager-serving-cert volume is marked optional and installer tests are updated accordingly.

Sequence Diagram(s)

sequenceDiagram
  participant Operator as hypershift-operator (run)
  participant Ensure as webhookcerts.EnsureWebhookCerts
  participant KubeAPI as Kubernetes API (Secrets)
  Operator->>Ensure: call EnsureWebhookCerts(namespace, serviceName)
  Ensure->>KubeAPI: GET Secret ServingCertSecretName
  alt serving secret present and valid
    KubeAPI-->>Ensure: Secret with cert and key
    Ensure-->>Operator: return (no-op)
  else missing/invalid
    Ensure->>Ensure: GenerateInitialWebhookCerts()
    Ensure->>KubeAPI: CREATE/UPDATE CA Secret
    Ensure->>KubeAPI: CREATE/UPDATE serving TLS Secret
    KubeAPI-->>Ensure: created/updated Secrets
    Ensure-->>Operator: return (success)
  end
Loading

Suggested reviewers

  • muraee
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning TestEnsureWebhookCerts lacks assertion messages; check requires assertions with meaningful messages to diagnose failures, but none are present. Add failure messages to all Expect assertions in TestEnsureWebhookCerts (e.g., g.Expect(err).ToNot(HaveOccurred(), "failed to ensure webhook certs") to match install_test.go patterns.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding bootstrap of serving certificates at hypershift operator startup, which is the primary objective across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are stable, deterministic, and contain no dynamic content. They use clear, static descriptive strings without pod name suffixes, timestamps, UUIDs, node names, or variables.
Microshift Test Compatibility ✅ Passed This PR adds only standard Go unit tests (func Test*), not Ginkgo e2e tests. The check for Ginkgo e2e test compatibility is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. The modified test files use Go's standard testing package (func TestXXX(t *testing.T)) for unit tests, not Ginkgo BDD syntax. Check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces no new scheduling constraints. Existing anti-affinity is preferred-only (topology-safe). Volume marked optional improves compatibility on SNO/Two-Node/HyperShift.
Ote Binary Stdout Contract ✅ Passed PR properly handles logging via zap (stderr by default) and error handling via fmt.Fprintf to os.Stderr in main(). New code in run() and EnsureWebhookCerts() uses structured logging, not stdout.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added; PR contains only standard Go unit tests (testing.T pattern) without IPv4 assumptions or external connectivity requirements.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 19, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go`:
- Around line 402-410: The create+update path can race because when c.Create
returns AlreadyExists you then call c.Update with existingSecret that lacks the
current ResourceVersion; re-fetch the secret from the API server into
existingSecret (e.g., call c.Get(ctx, namespacedName, existingSecret)) inside
the apierrors.IsAlreadyExists branch, copy servingSecret.Data into the freshly
fetched existingSecret, and then call c.Update(ctx, existingSecret); if Update
returns a conflict, retry the get/copy/update a few times or return the conflict
so callers can retry—update the logic around c.Create,
apierrors.IsAlreadyExists, existingSecret, and c.Update accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d0f83117-6c21-45d7-98cf-e7eb54954d13

📥 Commits

Reviewing files that changed from the base of the PR and between cf2b91f and 0fea0fb.

📒 Files selected for processing (3)
  • cmd/install/assets/hypershift_operator.go
  • hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go
  • hypershift-operator/main.go

Comment thread hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go Outdated
@clebs
Copy link
Copy Markdown
Member Author

clebs commented May 19, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@clebs: This pull request references Jira Issue OCPBUGS-84955, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clebs
Copy link
Copy Markdown
Member Author

clebs commented May 19, 2026

/cc @patjlm @muraee

/assign @enxebre

@openshift-ci openshift-ci Bot requested review from muraee and patjlm May 19, 2026 11:45
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 56.45161% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.42%. Comparing base (36dfb1b) to head (e1993ba).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...ontrollers/webhookcerts/webhookcerts_controller.go 72.09% 8 Missing and 4 partials ⚠️
cmd/install/assets/hypershift_operator.go 8.33% 11 Missing ⚠️
hypershift-operator/main.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8545      +/-   ##
==========================================
+ Coverage   40.40%   40.42%   +0.01%     
==========================================
  Files         755      755              
  Lines       93235    93273      +38     
==========================================
+ Hits        37675    37706      +31     
- Misses      52858    52862       +4     
- Partials     2702     2705       +3     
Files with missing lines Coverage Δ
cmd/install/install.go 62.25% <100.00%> (+0.47%) ⬆️
hypershift-operator/main.go 0.00% <0.00%> (ø)
cmd/install/assets/hypershift_operator.go 48.10% <8.33%> (+<0.01%) ⬆️
...ontrollers/webhookcerts/webhookcerts_controller.go 64.55% <72.09%> (+1.44%) ⬆️
Flag Coverage Δ
cmd-support 34.45% <26.66%> (+0.01%) ⬆️
cpo-hostedcontrolplane 41.76% <ø> (ø)
cpo-other 40.31% <ø> (ø)
hypershift-operator 50.75% <65.95%> (+0.02%) ⬆️
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

return writeCertFiles(certDir, servingSecret.Data[corev1.TLSCertKey], servingSecret.Data[corev1.TLSPrivateKeyKey])
}

func writeCertFiles(certDir string, cert, key []byte) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to write to disk as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because when secrets don't exist yet, volume mount has nothing to deliver. Pod started with Optional: true on the volume, so k8s mounts an empty dir. EnsureWebhookCerts creates the secrets via API, but
Kubernetes doesn't hot-reload secret volumes into an already-running container mid-startup. The webhook server needs certs on disk now to bind the TLS listener.

On next pod restart, secret exists and it works as it normally does therefore EnsureWebhookCerts returns early (no-op path).

Copy link
Copy Markdown
Contributor

@muraee muraee May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

controller-runtime certWatcher used by the webhook-server should take care of reloading the certs when the secret exist or changed (it uses fsnotify).
Have you tested without this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not, I will test it out.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muraee I have removed the writing to disk form the bootstrapping as well as the bootstrapping in the CLI.
All installation paths work, PTAL!

@muraee
Copy link
Copy Markdown
Contributor

muraee commented May 19, 2026

please also remove the code from the CLI generating those secrets.

// Generate self-managed webhook CA and serving cert when any webhook is enabled.
var webhookCABundle []byte
if opts.EnableDefaultingWebhook || opts.EnableConversionWebhook || opts.EnableValidatingWebhook || opts.EnableAuditLogPersistence {
caSecret, servingSecret, caBundle, err := webhookcerts.GenerateInitialWebhookCerts(operatorNamespace.Name, assets.HypershiftOperatorName)
if err != nil {
return nil, nil, fmt.Errorf("failed to generate webhook certs: %w", err)
}
objects = append(objects, caSecret, servingSecret)
webhookCABundle = caBundle
}

@clebs
Copy link
Copy Markdown
Member Author

clebs commented May 19, 2026

@muraee I left the hypershift install path there to not disrupt the other use cases, but will remove it and run some tests.

@clebs clebs force-pushed the bootstrap-certs branch 2 times, most recently from 7da2d17 to d1b3a26 Compare May 20, 2026 10:10
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/install/assets/hypershift_operator.go (1)

856-865: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Use a writable volume for --cert-dir; Secret volumes are always read-only.

Marking manager-serving-cert as optional removes the hard dependency on the Secret existing at startup, but Kubernetes Secret-backed volumes remain read-only regardless of the optional flag. The operator cannot bootstrap-generate certs and write them to /var/run/secrets/serving-cert. Use a writable volume (e.g., emptyDir) for --cert-dir and mount the Secret separately if read-only cert access is needed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/install/assets/hypershift_operator.go` around lines 856 - 865, The
current code appends a Secret volume named "serving-cert" and sets --cert-dir to
/var/run/secrets/serving-cert, but Secret volumes are read-only so the operator
cannot write bootstrap certs there; change to create a writable volume (e.g., an
emptyDir) for the path passed to --cert-dir and, if the Secret must be provided
for initial certs, mount the Secret at a different read-only path (or copy from
the Secret mount into the emptyDir at startup). Update the logic that appends to
volumes (replace or add an emptyDir Volume with the same mount path) and keep
the args append of "--cert-dir=/var/run/secrets/serving-cert" pointing to the
writable mount, while mounting "manager-serving-cert" Secret elsewhere if
needed.
🧹 Nitpick comments (1)
hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go (1)

454-480: ⚡ Quick win

Add a case for “CA exists, serving secret invalid” to lock bootstrap behavior.

This is the highest-risk branch for cert consistency. Add a test ensuring serving cert regeneration uses/pairs correctly with the persisted CA.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go`
around lines 454 - 480, Add a new test case to webhookcerts_controller_test.go
that sets up an existing CA secret (named CASecretName) with valid cert/key data
but an invalid/empty serving secret (ServingCertSecretName with empty Data or
missing TLS keys), then call EnsureWebhookCerts(t.Context(), cl, "hypershift",
"operator") and assert that: 1) the serving secret is regenerated
(updatedServingSecret.Data[corev1.TLSCertKey] and [corev1.TLSPrivateKeyKey] are
not empty), and 2) the CA secret remains the persisted CA
(updatedCASecret.Data[certs.CASignerCertMapKey] and [certs.CASignerKeyMapKey]
are unchanged/unchanged-from-original), ensuring EnsureWebhookCerts uses the
existing CA rather than re-bootstraping it.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/install/install_test.go`:
- Around line 562-589: The subtest "When webhooks are enabled it should not
render webhook cert secrets" can pass vacuously because no sensitive inputs
(pull secret) are provided; update the test to reuse the temp pull-secret setup
used in the table-driven case and supply it to the Options passed to
RenderHyperShiftOperator (ensure RenderSensitive: true and include the pull
secret field used by Options), then after decoding documents assert that at
least one Secret exists whose name is not "webhook-serving-ca" or
"manager-serving-cert" (and fail if zero non-webhook Secrets are found) so the
test actually exercises sensitive-secret rendering.

In
`@hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go`:
- Around line 448-451: The test is asserting CA secret contents using TLS keys
(corev1.TLSCertKey/corev1.TLSPrivateKeyKey) which can be nil and hide
regressions; update the assertions on updatedCASecret (and compare to caSecret)
to use the CA signer data keys instead (e.g. "ca.crt" and "ca.key" or the
repository's CA signer constants if defined) so you compare the actual CA signer
cert and key rather than TLS keys.

In `@hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go`:
- Around line 390-406: The current flow calls GenerateInitialWebhookCerts to
produce caSecret and servingSecret and then proceeds even if c.Create(ctx,
caSecret) returns AlreadyExists, which risks persisting a serving cert generated
from a transient CA; change the logic so that when c.Create(ctx, caSecret)
returns an AlreadyExists error you do NOT apply the servingSecret created from
the transient CA. Instead, retrieve the persisted CA secret (e.g., get
existingSecret by name), regenerate the serving cert using the persisted CA (or
use the support/upsert helper to safely reconcile both secrets), and then upsert
the serving secret (use support/upsert to create or update servingSecret) so the
serving cert always matches the stored CA; update references around
GenerateInitialWebhookCerts, caSecret, servingSecret, existingSecret, c.Create
and c.Update accordingly.

---

Outside diff comments:
In `@cmd/install/assets/hypershift_operator.go`:
- Around line 856-865: The current code appends a Secret volume named
"serving-cert" and sets --cert-dir to /var/run/secrets/serving-cert, but Secret
volumes are read-only so the operator cannot write bootstrap certs there; change
to create a writable volume (e.g., an emptyDir) for the path passed to
--cert-dir and, if the Secret must be provided for initial certs, mount the
Secret at a different read-only path (or copy from the Secret mount into the
emptyDir at startup). Update the logic that appends to volumes (replace or add
an emptyDir Volume with the same mount path) and keep the args append of
"--cert-dir=/var/run/secrets/serving-cert" pointing to the writable mount, while
mounting "manager-serving-cert" Secret elsewhere if needed.

---

Nitpick comments:
In
`@hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go`:
- Around line 454-480: Add a new test case to webhookcerts_controller_test.go
that sets up an existing CA secret (named CASecretName) with valid cert/key data
but an invalid/empty serving secret (ServingCertSecretName with empty Data or
missing TLS keys), then call EnsureWebhookCerts(t.Context(), cl, "hypershift",
"operator") and assert that: 1) the serving secret is regenerated
(updatedServingSecret.Data[corev1.TLSCertKey] and [corev1.TLSPrivateKeyKey] are
not empty), and 2) the CA secret remains the persisted CA
(updatedCASecret.Data[certs.CASignerCertMapKey] and [certs.CASignerKeyMapKey]
are unchanged/unchanged-from-original), ensuring EnsureWebhookCerts uses the
existing CA rather than re-bootstraping it.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3c7b830c-437f-4fea-a95f-fba702cfa9cc

📥 Commits

Reviewing files that changed from the base of the PR and between 0fea0fb and 7da2d17.

📒 Files selected for processing (6)
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/install.go
  • cmd/install/install_test.go
  • hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go
  • hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go
  • hypershift-operator/main.go

Comment thread cmd/install/install_test.go
Comment thread hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go Outdated
Comment thread hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go Outdated
@clebs clebs force-pushed the bootstrap-certs branch from d1b3a26 to dd1bc3d Compare May 20, 2026 11:26
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go (1)

375-376: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Re-fetch the serving Secret before the AlreadyExists update path.

EnsureWebhookCerts runs before leader election, and webhook-enabled installs default to two operator replicas. If one pod sees NotFound at Line 376 and another pod creates the Secret first, Line 428 reuses a zero-value existingSecret, so the Update goes out without the current resourceVersion and the loser can fail startup.

🔧 Minimal fix
 	if createErr := c.Create(ctx, servingSecret); createErr != nil {
 		if !apierrors.IsAlreadyExists(createErr) {
 			return fmt.Errorf("failed to create serving cert secret: %w", createErr)
 		}
+		existingSecret = &corev1.Secret{}
+		if err := c.Get(ctx, client.ObjectKey{Namespace: namespace, Name: ServingCertSecretName}, existingSecret); err != nil {
+			return fmt.Errorf("failed to get existing serving cert secret: %w", err)
+		}
+		existingSecret.Type = servingSecret.Type
 		existingSecret.Data = servingSecret.Data
 		if err := c.Update(ctx, existingSecret); err != nil {
 			return fmt.Errorf("failed to update serving cert secret: %w", err)
 		}
 	}

As per coding guidelines, “Use support/upsert/ for safe resource creation and updates”.

Also applies to: 424-430

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go`
around lines 375 - 376, EnsureWebhookCerts currently reads existingSecret once
and may reuse a zero-value existingSecret in the AlreadyExists update path;
re-fetch the serving Secret (use c.Get for ServingCertSecretName into
existingSecret) immediately before performing the Update in the AlreadyExists
branch so you have the latest resourceVersion, or switch to the support/upsert
helper to perform a safe create-or-update for ServingCertSecretName; update the
code paths that call c.Get/Update (references: EnsureWebhookCerts,
existingSecret, ServingCertSecretName, c.Get, Update) to always read the current
Secret before updating.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/install/install.go`:
- Around line 820-823: The install-time apply is dropping caBundle from rendered
CRD/webhook manifests (in setupCRDs) which can clear the live bundle when
applying; change setupCRDs (and the similar logic around the other render/apply
block at 949-957) so that when not in render-only/template mode it preserves the
live caBundle by reading the existing resource(s) before apply and copying the
existing .webhooks[*].clientConfig.caBundle (and any CRD conversion webhook
caBundle) into the manifest being applied, or skip removing caBundle entirely
for non-render installs; ensure this logic runs only for real applies (not
render-only) and reference setupCRDs and the apply/render code paths to
implement the fetch-and-merge behavior.

In
`@hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go`:
- Around line 454-480: Add a test case that covers the branch where the
persisted CA already exists but the serving cert is missing or empty: create a
pre-seeded CA secret (name CASecretName) with cert/key data
(certs.CASignerCertMapKey and certs.CASignerKeyMapKey), but either omit or
create an empty serving secret (name ServingCertSecretName), call
EnsureWebhookCerts with the fake client, and assert the function does not error
and that the regenerated serving secret contains TLS cert and key
(corev1.TLSCertKey, corev1.TLSPrivateKeyKey) whose chain/signature corresponds
to the existing CA; also assert the CA secret was not overwritten (its cert/key
remain unchanged) to exercise the Create(...CA...) -> AlreadyExists path in
EnsureWebhookCerts.

---

Duplicate comments:
In `@hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go`:
- Around line 375-376: EnsureWebhookCerts currently reads existingSecret once
and may reuse a zero-value existingSecret in the AlreadyExists update path;
re-fetch the serving Secret (use c.Get for ServingCertSecretName into
existingSecret) immediately before performing the Update in the AlreadyExists
branch so you have the latest resourceVersion, or switch to the support/upsert
helper to perform a safe create-or-update for ServingCertSecretName; update the
code paths that call c.Get/Update (references: EnsureWebhookCerts,
existingSecret, ServingCertSecretName, c.Get, Update) to always read the current
Secret before updating.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 13eb008a-01d6-4178-bea4-654aea103a10

📥 Commits

Reviewing files that changed from the base of the PR and between d1b3a26 and dd1bc3d.

📒 Files selected for processing (6)
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/install.go
  • cmd/install/install_test.go
  • hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go
  • hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go
  • hypershift-operator/main.go

Comment thread cmd/install/install.go
@clebs clebs force-pushed the bootstrap-certs branch from dd1bc3d to fe45965 Compare May 20, 2026 12:15
@enxebre
Copy link
Copy Markdown
Member

enxebre commented May 21, 2026

/lgtm

@clebs clebs marked this pull request as ready for review May 21, 2026 08:17
@openshift-ci openshift-ci Bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels May 21, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

clebs added 3 commits May 21, 2026 11:42
The webhook server requires TLS certs on disk before it can start
listening. Previously, certs were only created via `hypershift install`
manifests, meaning the operator would fail to start if the serving cert
secret was not yet present like it is the case when running `hyperhsift
install render` without the `--render-sensitive` flag.

This is used in a scenario where the rendered manifests are pushed to a
gitops workflow and, therefore, secrets can not be rendered.

By adding bootstrap cert generation at startup, the operator is self-sufficient:
if the secret exists, the volume mount delivers certs normally; if it is
missing or empty, certs are generated, persisted, and written to disk.
The secret volume is now marked optional so the pod can start without it.

Signed-off-by: Borja Clemente <bclement@redhat.com>
Test the new certs secret bootstrapping logic in the
hypershift-operator.

Signed-off-by: Borja Clemente <bclement@redhat.com>
Cert secret bootstrapping now lives in the hypershift_operator and the
CLI no longer needs to create the secrets.

Signed-off-by: Borja Clemente <bclement@redhat.com>
@clebs clebs force-pushed the bootstrap-certs branch from 5409d1f to e1993ba Compare May 21, 2026 09:42
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 21, 2026
@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2057376357519724544 | Cost: $1.5757972500000001 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@enxebre
Copy link
Copy Markdown
Member

enxebre commented May 21, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented May 21, 2026

Test Results

e2e-aws

e2e-aks

@clebs
Copy link
Copy Markdown
Member Author

clebs commented May 21, 2026

/retest

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2057513893957210112 | Cost: $2.20698425 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@clebs
Copy link
Copy Markdown
Member Author

clebs commented May 21, 2026

/test e2e-aws-4-22 e2e-aks-4-22 e2e-aws

@clebs
Copy link
Copy Markdown
Member Author

clebs commented May 22, 2026

/test e2e-aws

3 similar comments
@clebs
Copy link
Copy Markdown
Member Author

clebs commented May 22, 2026

/test e2e-aws

@clebs
Copy link
Copy Markdown
Member Author

clebs commented May 23, 2026

/test e2e-aws

@clebs
Copy link
Copy Markdown
Member Author

clebs commented May 23, 2026

/test e2e-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2058320635280494592 | Cost: $4.210366250000001 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 24, 2026

@clebs: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws e1993ba link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

I now have all the evidence needed. Here is the complete analysis:

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
  • Build ID: 2058320635280494592
  • Target: e2e-aws
  • PR: openshift/hypershift#8545OCPBUGS-84955: bootstrap serving certs at hypershift operator startup
  • Test Results: 592 tests, 2 failures, 25 skipped (565 passed)

Test Failure Analysis

Error

TestCreateClusterRequestServingIsolation/ValidateHostedCluster (600.00s):
  Failed to wait for kubeconfig to be published for HostedCluster
  e2e-clusters-qmgnz/request-serving-isolation-v6xzh in 10m0s: context deadline exceeded

  KubeAPIServerAvailable=False: NotFound(Kube APIServer deployment not found)
  AWSEndpointAvailable=False: AWSError(cannot list security groups: operation error EC2:
    DescribeSecurityGroups, get identity: get credentials: failed to refresh cached
    credentials, failed to retrieve jwt from provide source, unable to read file at
    /var/run/secrets/openshift/serviceaccount/token: open
    /var/run/secrets/openshift/serviceaccount/token: no such file or directory)
  Degraded=True: UnavailableReplicas(capi-provider deployment has 1 unavailable replicas)
  Available=False: KubeconfigWaitingForCreate(Waiting for hosted control plane kubeconfig
    to be created)

Summary

The TestCreateClusterRequestServingIsolation test — which uniquely runs on a pre-existing shared management cluster (c5a50820c6-mgmt) rather than a freshly-provisioned local one — failed because the hosted cluster's control plane never became available within the 10-minute timeout. The kube-apiserver deployment was never created, the AWS endpoint service controller could not authenticate due to a missing projected service account token file (/var/run/secrets/openshift/serviceaccount/token), and the capi-provider pod was stuck pending. All 565 other tests (including 11 other TestCreateCluster* variants) passed, confirming the failure is isolated to the shared management cluster environment. This failure is unrelated to PR #8545, which modifies webhook certificate bootstrapping in the hypershift-operator — a different binary and deployment from the control-plane-operator where the failure occurred.

Root Cause

The failure chain on the shared management cluster (c5a50820c6-mgmt) is:

  1. Missing service account token projection: The control-plane-operator pod in the HCP namespace (e2e-clusters-qmgnz-request-serving-isolation-v6xzh) could not read the projected service account token at /var/run/secrets/openshift/serviceaccount/token. This token is needed for AWS IRSA (IAM Roles for Service Accounts) authentication. Without it, the awsendpointservice controller repeatedly failed (6+ occurrences logged) when calling EC2:DescribeSecurityGroups, preventing AWSEndpointAvailable from becoming True and blocking AWSDefaultSecurityGroupCreated.

  2. kube-apiserver never created: The kube-apiserver ControlPlaneComponent remained stuck in WaitingForDependencies: etcd despite etcd achieving quorum (EtcdAvailable=True: QuorumAvailable). The CPO continuously reconciled the kube-apiserver component but never progressed past dependency checking — the KAS deployment was never rendered or applied. At dump time, only 4 deployments existed in the HCP namespace (capi-provider, cluster-api, control-plane-operator, router) with no kube-apiserver deployment.

  3. Circular dependency with capi-provider: The capi-provider pod was stuck in Pending/PodInitializing because its availability-prober init container was waiting for the kube-apiserver (which never existed), creating a circular deadlock.

Why this is unrelated to PR #8545: The PR changes webhook cert bootstrapping in cmd/install/ and hypershift-operator/controllers/webhookcerts/ — these affect the hypershift-operator deployment, not the control-plane-operator where the failure occurs. The failure is in the CPO's AWS IRSA authentication and CPC dependency resolution on the shared management cluster. The 565 passing tests confirm the hypershift-operator (with the PR changes) functions correctly.

The most likely root cause is a transient issue with the shared management cluster's OIDC/token projection configuration, which would not affect the freshly-provisioned management clusters used by other tests. This is a known flaky test pattern for TestCreateClusterRequestServingIsolation due to its unique dependency on external management cluster infrastructure.

Recommendations
  1. Retry the CI job — The failure is transient and isolated to the shared management cluster. A /retest should succeed since 565/567 tests passed.
  2. Investigate the shared management cluster's OIDC/token projection — The missing /var/run/secrets/openshift/serviceaccount/token file suggests the pod's projected service account token volume was not properly configured or populated on c5a50820c6-mgmt. This could be a stale OIDC provider or expired signing key on the shared cluster.
  3. Investigate the kube-apiserver CPC dependency stall — The CPC reported WaitingForDependencies: etcd even after etcd was available (QuorumAvailable). This suggests a stale dependency evaluation or race condition in the CPC controller's dependency resolution that may be more likely to manifest on shared management clusters with higher load.
  4. Check historical flakiness — Compare recent runs of pull-ci-openshift-hypershift-main-e2e-aws to confirm whether TestCreateClusterRequestServingIsolation fails intermittently on the shared management cluster.
Evidence
Evidence Detail
Failed tests TestCreateClusterRequestServingIsolation (995.73s), TestCreateClusterRequestServingIsolation/ValidateHostedCluster (600.00s / 10m timeout)
Passed tests 565 of 567 tests passed, including TestCreateCluster, TestCreateClusterPrivate, TestCreateClusterCustomConfig, TestCreateClusterProxy, TestKarpenter, TestUpgradeControlPlane, and 9 other variants
Failing cluster e2e-clusters-qmgnz/request-serving-isolation-v6xzh on shared management cluster c5a50820c6-mgmt
KubeAPIServerAvailable False: NotFound(Kube APIServer deployment not found) — KAS deployment never created
AWSEndpointAvailable False: AWSError(cannot list security groups: ...unable to read file at /var/run/secrets/openshift/serviceaccount/token: no such file or directory)
Degraded True: UnavailableReplicas(capi-provider deployment has 1 unavailable replicas) — init container stuck waiting for KAS
EtcdAvailable True: QuorumAvailable — etcd was healthy, confirming the dependency stall is the kube-apiserver CPC bug
Deployments in HCP namespace Only 4: capi-provider, cluster-api, control-plane-operator, router (no kube-apiserver)
Token error count 6 occurrences in build log, all from TestCreateClusterRequestServingIsolation only
PR scope Changes cmd/install/, hypershift-operator/controllers/webhookcerts/, hypershift-operator/main.go — webhook cert bootstrapping in the hypershift-operator, not the control-plane-operator
Unique test property Only test using shared pre-existing management cluster via --e2e.management-cluster-name=c5a50820c6-mgmt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants