Skip to content

Conversation

@tbuskey
Copy link
Contributor

@tbuskey tbuskey commented Nov 6, 2025

Pre PR of the whole mess to be redone properly. Think of it as a heads up.

This installs trustee onto the cluster and configures it.
Helped by cursor so lots of iterations.

sandboxed-containers-operator-trustee-install-commands.sh can run against an existing cluster to install/config trustee. It assumes sandboxed-containers-operator-env-cm-commands.sh in prow configured things for coco.

It should create a script that has the encrypted INITDATA for peer-pods-cm and will patch it. So it could be used on another cluster.

Changes sandboxed-containers-operator-create-prowjob-commands.sh to allow it to create openshift-sandboxed-containers-operator-devel__downstream-release.yaml and openshift-sandboxed-containers-operator-devel__downstream-candidate.yaml. We should use the script to update those files and not edit them directly.

sandboxed-containers-operator-create-prowjob-commands.sh needs more of the options for sandboxed-containers-operator-trustee-install-commands.sh added.

Readme files haven't been reviewed yet

@openshift-ci openshift-ci bot requested review from c3d and jensfr November 6, 2025 19:46
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tbuskey

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2025
@tbuskey
Copy link
Contributor Author

tbuskey commented Nov 6, 2025

/assign @beraldoleal
/assign @ldoktor
/assign @wainersm
/assign @vvoronko

@tbuskey
Copy link
Contributor Author

tbuskey commented Nov 6, 2025

/pj-rehearse periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release419-azure-ipi-peerpods

@openshift-ci-robot
Copy link
Contributor

@tbuskey: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@tbuskey
Copy link
Contributor Author

tbuskey commented Nov 10, 2025

/pj-rehearse periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release419-azure-ipi-peerpods

@openshift-ci-robot
Copy link
Contributor

@tbuskey: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@tbuskey
Copy link
Contributor Author

tbuskey commented Nov 10, 2025

/pj-rehearse periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate419-azure-ipi-peerpods

@openshift-ci-robot
Copy link
Contributor

@tbuskey: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@tbuskey
Copy link
Contributor Author

tbuskey commented Nov 10, 2025

/pj-rehearse periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release419-azure-ipi-peerpods

@openshift-ci-robot
Copy link
Contributor

@tbuskey: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@tbuskey
Copy link
Contributor Author

tbuskey commented Nov 10, 2025

/pj-rehearse periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate419-azure-ipi-peerpods

@openshift-ci-robot
Copy link
Contributor

@tbuskey: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

subfolder="trustee-fbc/"
# Trustee Catalog Configuration - get latest or use provided
if [[ -z "${TRUSTEE_CATALOG_TAG:-}" ]]; then
TRUSTEE_CATALOG_TAG=$(get_latest_catalog_tag "https://quay.io/api/v1/repository/redhat-user-workloads/ose-osc-tenant/trustee-test-fbc")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the regexp changed from "^trustee-fbc-${OCP_VER}-on-push-.*-build-image-index$" to "^[0-9]+\.[0-9]+(\.[0-9]+)?-[0-9]+$", is this fine? (it'd be easier to review if you made incremental changes, this doesn't seem related to the current PR still it's hard to extract what is and what isn't)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trustee's catalog is changed to match the format that OSC uses. So, same regexp for both now.


## Troubleshooting

If the step fails:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this section, we should add it to other steps as well :-)

echo "Waiting for subscription '${subscription_name}' to be ready (state: AtLatestKnown)..."
local subscription_ready=false

for i in $(seq 1 "$max_attempts"); do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer looping over a specific deadline than number of iterations, it seems more predictable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too.
There is a deadline though. It sleeps T sec every time so the deadline is T * $max_attempts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with those deadlines is they are very rough and depend on whether things in between don't hang/fail/someone changes them, etc. (it should be at least sleep_seconds * max_attempts, but who knows what else adds and where someone forgets to put something, what interrupts the sleep and so on)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I take this back as I recently learned the hard way that SECONDS is not monotonic. So this seems fine :-)


if [ "$subscription_ready" = false ]; then
echo "Warning: Subscription '${subscription_name}' did not reach AtLatestKnown state after $((max_attempts * sleep_seconds)) seconds"
echo "Please check the subscription status with: oc get subscription ${subscription_name} -n ${namespace} -o yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to print the oc get subscription as part of the script before exiting.


if [ -z "$csv_name" ]; then
echo "Warning: Could not get installedCSV from subscription '${subscription_name}'"
echo "Please check the subscription status with: oc get subscription ${subscription_name} -n ${namespace} -o yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here, if we know something might help, we should IMO print it and have it as part of the logs.

for i in $(seq 1 "$max_attempts"); do
# Check if CSV (ClusterServiceVersion) is in Succeeded phase with InstallSucceeded reason
local csv_status=$(oc get csv "${csv_name}" -n "${namespace}" -o jsonpath='{.status.phase}{.status.reason}' 2>/dev/null || echo "")
if [[ "$csv_status" == "SucceededInstallSucceeded" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check the reason? What if it was installed before and we only updated to the latest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you decide to keep the check, would you mind splitting them at least by space?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same check we have in the test automation.
We needed to check the reason and phase.
The jsonpath mashes them together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but it's possible to put there a space in between, something like csv_status=$(oc get csv "${csv_name}" -n "${namespace}" -o jsonpath='{.status.phase} {.status.reason}' (haven't tested it now, but used it before, either like this or with extra quotation)

if [ "$installed" = false ]; then
echo "Warning: CSV '${csv_name}' did not finish after $((max_attempts * sleep_seconds)) seconds"
echo "Please check the subscription status with: oc get subscription ${subscription_name} -n ${namespace}"
echo "And check the CSV status with: oc get csv ${csv_name} -n ${namespace} -o yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please attempt to get&print the useful debug info before exiting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree!

# Create OperatorGroup if it doesn't exist
if ! resource_exists "operatorgroup" "trustee-operator-group" "${operator_namespace}"; then
echo "Creating OperatorGroup 'trustee-operator-group'..."
cat > trustee-operatorgroup.yaml << EOF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we likely to use this file later? If not than I'd suggest using cat | oc apply -f - << EOF instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... everywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.
It can be useful for debugging, but we delete later & it should be in the cluster if we really needed it

# Create the secret with public key
oc create secret generic kbs-auth-public-key --from-file=publicKey -n trustee-operator-system

echo "Created kbs-auth-public-key secret"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the key files later? It'd be good practice to delete them after creating the secret otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do get deleted at the end, but deleting them ASAP is probably better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it was production code, I'd insist on shred directly afterwards. Since it's CI we should at least use rm and ideally here

default allow = false
EOF
# Add allow rules if provided
if [ ${#allow_rules[@]} -gt 0 ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the if is needed here, if the allow_rules is empty, it will simply iterate through nothing and proceed.


echo "Creating attestation token secret..."
if resource_exists "secret" "attestation-token"; then
echo "Secret 'attestation-token' already exists"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many places where if exists; echo we're using existing.... The question is, will it work? What I mean is for some tokens we can get the existing secrets, but for some we rely on our generated values. If these exist in before, will the following steps be able to get to the trustee? (not talking about this particular steps, but about all the other where we accept existing value, my main concern is whether we should allow that or fail the workflow in such case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to rethink that too

"ghcr.io/confidential-containers/test-container-image-rs|sigstoreSigned|kbs:///default/cosign-public-key/test"

if resource_exists "secret" "security-policy"; then
echo "Secret 'security-policy' already exists"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example here, do we want to proceed when this exists? Do we want to use it or overwrite it with our expected secret? Will the testing work with any pre-existing setting?

else
echo "Warning: kbs-https-certificate secret not found, using placeholder certificate"
KBS_CERT="-----BEGIN CERTIFICATE-----
MIICertificatePlaceholder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this cert gona work? Is it usable?

fi

# Convert initdata.toml to base64 for INITDATA
INITDATA_STRING=$(cat initdata.toml | gzip| base64 -w0 )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than magically named files created by called functions I'd prefer calling a function and redirect it to file which I'm going to use there:

    create_initdata_config "${TRUSTEE_URL}" "${KBS_CERT}" > initdata.toml
...
INITDATA_STRING=$(cat initdata.toml | gzip| base64 -w0 )

What do you think? (I'm afraid people might forget where this came from and this seems more explicit to me, anyway not a strong opinion)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like magic names either. Its always useful to be able to define it

echo ""
echo "Next steps:"
echo "1. Trustee operator subscription is automatically handled by this script"
echo " - To use a different catalog: TRUSTEE_CATALOG_SOURCE_NAME='my-catalog' ./trustee_configure.sh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the trustee_configure.sh comes from? I'm missing some context here (probably obvious but I haven't deployed trustee yet)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created the standalone script in a different directory and tested. Then moved & converted it into a prow step.


# trustee install
# Installs and configures the Trustee Operator for Confidential Containers
# Can run standalone or as part of sandboxed-containers-operator-pre chain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please link the documentation based on which this file is generated? Either here or in the steps documentation so it's easier found and verified (also to find more info in case things malfunction). I think it'd be nice to include the actual version so people know what it's based on and eventually we might update it once in a while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled in the htmls from https://docs.redhat.com/en/documentation/openshift_sandboxed_containers/1.10/ (which is trustee 0.4.2) and had cursor use it to generate the script. Then I iterated w/ cursor to add things to open up for trustee.

Trustee 1.0 changes things & there is an internal docs PR with the changes. So I need to revisit the script.

I'm definitely going to break out the trustee bits from this PR!

Copy link
Contributor

@ldoktor ldoktor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, what a beast. Except of the obvious splitting the unrelated changes I left a few comments here and there. It'd be nice to include the documentation link to know what it's based on, then it'd be nice to print debug output in case of failures so we get at least something to start with and a few little things here and there.

I have not executed this yet so functionally-wise I can't tell much, hopefully tomorrow I'll get to that (is it ready, is it suppose to work? Is kubernetes enough or do I need full OCP?)

@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@tbuskey: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate419-aws-ipi-peerpods N/A periodic Periodic changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release419-azure-ipi-peerpods N/A periodic Periodic changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release419-aws-ipi-coco N/A periodic Periodic changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release419-azure-ipi-coco N/A periodic Periodic changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate-aws-ipi-peerpods N/A periodic Registry content changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate419-azure-ipi-coco N/A periodic Periodic changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate-aws-ipi-coco N/A periodic Ci-operator config changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release419-aws-ipi-peerpods N/A periodic Periodic changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release-azure-ipi-coco N/A periodic Ci-operator config changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate-azure-ipi-kata N/A periodic Registry content changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release-azure-ipi-kata N/A periodic Registry content changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release-azure-ipi-peerpods N/A periodic Registry content changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate-azure-ipi-coco N/A periodic Ci-operator config changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release-aws-ipi-coco N/A periodic Ci-operator config changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release419-azure-ipi-kata N/A periodic Periodic changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate419-aws-ipi-coco N/A periodic Periodic changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate-azure-ipi-peerpods N/A periodic Registry content changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate419-azure-ipi-kata N/A periodic Periodic changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release-aws-ipi-peerpods N/A periodic Registry content changed
periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate419-azure-ipi-peerpods N/A periodic Periodic changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 11, 2025

@tbuskey: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate419-azure-ipi-peerpods d2d4a73 link unknown /pj-rehearse periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-candidate419-azure-ipi-peerpods
ci/rehearse/periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release419-azure-ipi-peerpods d2d4a73 link unknown /pj-rehearse periodic-ci-openshift-sandboxed-containers-operator-devel-downstream-release419-azure-ipi-peerpods
ci/prow/step-registry-shellcheck 3c1a6df link true /test step-registry-shellcheck

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tbuskey tbuskey closed this Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants