[WIP] MCO-1669: add BootImageSkewEnforcement API #2357

djoshy · 2025-06-05T16:50:28Z

Based on discussions from openshift/enhancements#1761:
Workflow for pre-release skew enforcement not active

flowchart TD
    A[Operator sync loop] --> B{Is the skew API spec/status set?}   
    B --> |No|C{Can the MCO manage boot images for this cluster?}
    B --> |Yes|E[Done]   
    C --> |No|D[Set Upgradeable=False to force cluster admin opinion]
    C --> |Yes|F[Set skew API status to Automatic]
    D --> E
    F --> E

Workflow for release n, skew enforcement is active

flowchart TD
    A[Sync loop for skew mechanism. Triggered on: skew API knob updates, boot image update conditions] --> V{Determine skew API mode: Automatic, Manual or None?}   
    V --> |None|L[Raise a low level Prometheus alert to indicate scaling risk]
    V --> |Manual|I 
    V --> |Automatic|E{Is the boot image controller disabled or in an error mode?}
    E --> |No| H[Wait until boot image controller is not progressing]
    E --> |Yes| G[Throw an error to cluster admin]
    G --> K
    H --> I[Is the current skew compliant against the limit defined in the release image?]   
    I --> |Yes| Done
    I --> |No| K[MCO sets Upgradeable=False] 
    K --> Done
    L --> Done

openshift-ci-robot · 2025-06-05T16:50:32Z

@djoshy: This pull request references MCO-1669 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

WIP boot image enforcement API, based on discussions from openshift/enhancements#1761

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-06-05T16:50:33Z

Hello @djoshy! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

openshift-ci · 2025-06-05T16:51:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: djoshy
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

operator/v1/types_machineconfiguration.go

djoshy · 2025-07-14T20:21:59Z

Thanks for the questions & review(sorry it took a while!), this should be ready for another look. Happy to hop on a call if that is easier.

Update: Did another push to fix up some tests.

everettraven

Leaving another handful of comments.

I'm also happy to hop on a call if you think it would be beneficial.

operator/v1/types_machineconfiguration.go

everettraven · 2025-07-15T18:22:21Z

operator/v1/types_machineconfiguration.go

+	// clusterBootImage describes the current boot image of the cluster. This will be used to enforce the skew limit.
+	// This value will be compared against the cluster's skew limit to determine skew compliance.
+	// Required when mode is set to "Automatic" or "Manual" and forbidden otherwise.
+	// +optional
+	ClusterBootImage *ClusterBootImage `json:"clusterBootImage,omitempty"`


Usually, discriminated union members will follow the name of their mode counterpart. i.e:

mode: Automatic automatic: ...

or

mode: Manual manual: ...

Another thing I'm curious about now that I've got a bit more context - why do you want to require the clusterBootImage when set to Automatic?

Presumably, if the MCO is able to determine the cluster boot image by itself should it just do it and perform the skew handling automatically?

If a user were to explicitly set Automatic, I imagine they are wanting to have MCO handle all of that and that they likely don't have the cluster boot image information on hand. Whereas if they set Manual they are explicitly stating they want to manually manage that information and I would expect them to have it on hand.

This is a good question. The determination of boot image version isn't straightforward and varies wildly per platform. There currently isn't a single source of truth for the admin or the controllers to use in the cluster. So I thought this would be a good way to represent that information in the API. Perhaps for the Automatic case; clusterBootImage makes more sense as a Status only field, but for Manual we could have it in Spec and Status?

Updated with this Spec/Status shape, but Automatic only specifies a version in the status version. So I'm envisioning something like the following examples.

On a cluster that defaults into Automatic mode(no admin opinion):

spec: .. status: bootImageSkewEnforcementStatus: mode: Automatic automatic: ocpVersion: "4.18.2" rhcosVersion: "9.6.20250523-1"

On a cluster that defaults into manual mode(no admin opinion):

spec: .. status: bootImageSkewEnforcementStatus: mode: Manual manual: ocpVersion: "4.18.2"

On a cluster that an admin explicitly sets to Manual, and performs updates:

spec: bootImageSkewEnforcement: mode: Manual manual: ocpVersion: "4.18.2" rhcosVersion: "9.6.20250523-1" status: bootImageSkewEnforcementStatus: mode: Manual manual: ocpVersion: "4.18.2" rhcosVersion: "9.6.20250523-1"

On a cluster that an admin disables this feature:

spec: bootImageSkewEnforcement: mode: None status: bootImageSkewEnforcementStatus: mode: None

On a cluster that an admin explicitly sets to Automatic:

spec: bootImageSkewEnforcement: mode: Automatic status: bootImageSkewEnforcementStatus: mode: Automatic automatic: ocpVersion: "4.18.2" rhcosVersion: "9.6.20250523-1"

For this last case, I'm not entirely convinced if it needs to be supported. The user having the power to go to "Manual" and "None" via an explicit value makes sense.

Hmm, I guess a workflow to consider for this would be a user going from Manual/None to Automatic mode; would deleting the spec.bootImageSkewEnforcement be good UX that case? If the MCO is able to automatically determine that the cluster is able to perform skew management in a hands off fashion, it would default the status to Automatic(if spec is empty). Or would it be better to have an explicit Automatic setting in the spec?

operator/v1/types_machineconfiguration.go

everettraven

A few minor comments, but other than these I think this looks good

operator/v1/types_machineconfiguration.go

everettraven · 2025-07-22T12:24:34Z

operator/v1/types_machineconfiguration.go

+	// The default for mode is Automatic for clusters that support automatic boot image updates and
+	// Manual for clusters that do not support automatic boot image updates.


How do I know whether or not my cluster supports automatic boot image updates?

So on cluster, you'll be able to see this by checking the MachineConfiguration CR. When spec.managedBootImages is undefined, we interpret that as no opinion from the admin and status.managedBootImagesStatus will reflect the default boot image configuration(platform dependent as I mentioned earlier). managedBootImagesStatus may

contain a MachineManager set to "All". This is what I refer to as automatic/opted-in.

contain aMachineManager set to "None". This is a platform that does support boot image updates; but not by default. The admin has to opt-in via defining a machineManager in spec.managedBootImages. We may move this to default in a later release once we have gained enough confidence in the platform.

be undefined. This means a platform that we have not yet explored for boot image updates. The admin is prevented from adding to spec.managedBootImages for these cases via a validating admission policy.

Other than that, we hope to socialize new platforms we add support/default for via documentation, KB articles and such.

I'm not sure I understand the UX fully here, so it would be helpful if you can walk me through this again. At some point a user will upgrade to a version with this skew enforcement API in place. Let's break them down into a few categories:

I'm on cloud and I just want automatic to be on -> in this case, assuming they haven't fiddled with the default-on bootimage management fields, they should not need to do anything. The spec will be empty and the status will be set by the MCO to automatic (since we detect they are in a managed case, right?)

I'm on cloud but with custom bootimages -> in this case, I guess the MCO doesn't know what to do with the existing image since it's not in any stream, so we default to None or just have no status? What happens now, do we alert the user and/or prevent scaling?

I'm on-prem with manual bootimage -> in this case, I would not have been able to set these fields until they were turned on, so I would need to upgrade first, then set manual here, and until that happens, the status is empty?

No dumb questions, these are great! Before I answer, I want to explain how I think about how the MCO would detect the current cluster boot image(I'm open to other ideas/strategies here!):

If boot image updates are enabled; and all machine resources have been successfully reconciled, we could use the current cluster version & the rhcos version from the coreos-bootimages configmap to determine the current boot image.

If boot image updates are only partially enabled or completely disabled; and machine resources can't be inspected to determine the boot image version, we would have to assume the boot image to be the cluster's installation version from the version history. For most platforms, we will very likely have to use latter method.

Now to your questions:

Yes, this is correct.

For custom boot images in the cloud platforms, the MCO currently (silently) skips over updating them. We have a couple of paths here:

Once we've covered all the marketplace/manageable scenarios for cloud platforms, we can add a degrade/error when the MCO encounters a custom boot image during an update. I expect most of these custom boot image users to disable boot image updates to fix the error. When we do enable skew enforcement(SE) in this case, the cluster would immediately be considered out of skew as the current boot image can't be estimated. If they leave the error in place and not disable boot image updates, SE should also consider a cluster to be in that error mode be out of skew.

If we choose to not have an error mode for the custom boot image case; we will still want add a method to indicate to the MCO that machine resources have been skipped over and not reconciled. With that in place, when skew enforcement is enabled, the MCO would be able to determine that the cluster is out of skew.

In both of these paths, the admin would be expected to manually switch SE spec to Manual(and indicate their boot image) or to None depending on their scaling needs. Since this case involves a cloud env that does likely support automatic boot image updates, the SE status would initially be set to Automatic by the MCO, and as stated earlier, the detected cluster boot image.

For any case where the MCO cannot automatically manage boot images, the MCO will default to Manual mode in the SE status, along with the cluster's boot image. The admin would be expected to manually switch SE spec to Manual(and indicate their boot image) or to None depending on their scaling needs.

Ok, so we will try to set some value in the status ourselves based on our interpretation of the cluster, and error if we cannot. The user can set a spec as they wish to override it (so in the second example, if I used some unfound bootimage, I could still manually set spec to automatic to allow the MCO to go ahead and ignore the error, is that correct?)

For Automatic clusters, most cases would need no intervention; except if the admin has manually set a custom boot image. I don't think this is very common(cc @yuqi-zhang in case I'm wrong here)

Based on our discussions, I think my understanding is that if the user has set a custom bootimage, our mechanisms will not find it in the streams (regular, managed (ARO/ROSA), marketplace (OPP, OKE, OCP)) so we would end up defaulting to manual and require user intervention, right? Or are you saying that all cloud platforms would default to automatic, and if the user somehow has set their own value, we'd instead just error and let the user sort it out?

There's a few things we discussed in this thread, but to recap some of the main open points:

spec vs. status and defaulting - we will only default to "automatic" or "manual" if the user doesn't input, and the contentious point is that by setting it to "manual" by default, we'd need immediate user intervention

the mechanism of alerting the user of actions/interventions - whether we set upgradeable to false, or some other mechanism

Is that accurate?

and the contentious point is that by setting it to "manual" by default, we'd need immediate user intervention

I think the contentious point is that this seems like a setting that admins need to be explicitly aware of, and need to make a decision on, if we cannot automatically determine the state this should be put into. Otherwise, they may not be able to upgrade and/or scale as they are used to (is my understanding of this correct?).

What are the criteria by which the MCO will determine whether a cluster is automatic or manual? Was there a decision tree in the EP that covers this?

Platform - Some platforms are never automatic?

Whether the existing boot images are known? Someone using a custom boot image doesn't get updates?

???

Do I need a managedBootImages configuration for the cluster to be automatic?

What would happen if an admin set this to Automatic on a cluster that didn't support automatic updates? Is that theoretically possible?

Thinking on the update flow.

If we can determine that the cluster is in a suitable condition to be adopted for automatic updates, I think defaulting status to automatic makes sense.

If we cannot determine that the cluster is suitable for automatic updates, there's a mention in the thread that we will just use the version from the cluster image, how does that work? How can we confident that this image is correct? Maybe I'm missing what you mean there?

If we cannot automatically ascertain the version figures, how would we populate the manual spec field with either the RHCOS or OCP versions? Wouldn't we need the admin to make an explicit choice?

I've updated the description workflows based on our discussion, PTAL to confirm we're all on the same page. I think we also settled on "Automatic" not being settable from the spec side, but I wanted to double check that before updating the API.

everettraven · 2025-07-22T12:25:52Z

operator/v1/tests/machineconfigurations.operator.openshift.io/BootImageSkewEnforcement.yaml

Add some tests for the status field as well?

Done, PTAL when you have a sec!

yuqi-zhang

A couple of maybe dumb questions:

yuqi-zhang · 2025-07-28T21:22:23Z

operator/v1/types_machineconfiguration.go

+	// The default for mode is Automatic for clusters that support automatic boot image updates and
+	// Manual for clusters that do not support automatic boot image updates.


I'm not sure I understand the UX fully here, so it would be helpful if you can walk me through this again. At some point a user will upgrade to a version with this skew enforcement API in place. Let's break them down into a few categories:

I'm on cloud and I just want automatic to be on -> in this case, assuming they haven't fiddled with the default-on bootimage management fields, they should not need to do anything. The spec will be empty and the status will be set by the MCO to automatic (since we detect they are in a managed case, right?)

I'm on cloud but with custom bootimages -> in this case, I guess the MCO doesn't know what to do with the existing image since it's not in any stream, so we default to None or just have no status? What happens now, do we alert the user and/or prevent scaling?

I'm on-prem with manual bootimage -> in this case, I would not have been able to set these fields until they were turned on, so I would need to upgrade first, then set manual here, and until that happens, the status is empty?

yuqi-zhang · 2025-07-28T21:25:51Z

operator/v1/types_machineconfiguration.go

+	// +kubebuilder:validation:MaxLength:=21
+	// +kubebuilder:validation:MinLength:=14
+	// +optional
+	RHCOSVersion string `json:"rhcosVersion,omitempty"`


Just curious, what scenarios do you expect the user to explicitly set this? The one I thought of was: I'm not sure what OCP version my RHCOS corresponds to, or I have some type of custom bootimage.

In that case, wouldn't it be more helpful to have one of OCPVersion or RHCOSVersion be required, and if one is set, you don't have to set the other? But you can set both if you want more strict version checking? (is that supported via API validation?)

I think this a valid use case, and should be possible as a validation at the parent level. I'll try updating it to do this.

JoelSpeed · 2025-08-06T14:23:32Z

operator/v1/types_machineconfiguration.go

+	// The default for mode is Automatic for clusters that support automatic boot image updates and
+	// Manual for clusters that do not support automatic boot image updates.


What are the criteria by which the MCO will determine whether a cluster is automatic or manual? Was there a decision tree in the EP that covers this?

Platform - Some platforms are never automatic?

Whether the existing boot images are known? Someone using a custom boot image doesn't get updates?

???

Do I need a managedBootImages configuration for the cluster to be automatic?

What would happen if an admin set this to Automatic on a cluster that didn't support automatic updates? Is that theoretically possible?

JoelSpeed · 2025-08-06T14:26:54Z

operator/v1/types_machineconfiguration.go

+	// None means that the MCO will no longer monitor the boot image skew. This may affect
+	// the cluster's ability to scale.


I think we can expand on this a little, probably useful to explain that this means the cluster has no way to understand the compatibility between X and Y, where X and Y are the boot image and the ignition/pivot?

Ack, will do!

JoelSpeed · 2025-08-06T14:32:27Z

operator/v1/types_machineconfiguration.go

+	// The default for mode is Automatic for clusters that support automatic boot image updates and
+	// Manual for clusters that do not support automatic boot image updates.


Thinking on the update flow.

If we can determine that the cluster is in a suitable condition to be adopted for automatic updates, I think defaulting status to automatic makes sense.

If we cannot determine that the cluster is suitable for automatic updates, there's a mention in the thread that we will just use the version from the cluster image, how does that work? How can we confident that this image is correct? Maybe I'm missing what you mean there?

If we cannot automatically ascertain the version figures, how would we populate the manual spec field with either the RHCOS or OCP versions? Wouldn't we need the admin to make an explicit choice?

JoelSpeed · 2025-08-06T14:35:03Z

operator/v1/types_machineconfiguration.go

@@ -55,8 +53,123 @@ type MachineConfigurationSpec struct {
 	// has no effect on cluster upgrades which will still incur node disruption where required.
 	// +optional
 	NodeDisruptionPolicy NodeDisruptionPolicyConfig `json:"nodeDisruptionPolicy"`
+
+	// bootImageSkewEnforcement is an optional field that can be used to configure how version skew is


Is there a relationship between this new field and the managed boot images that we need to enforce?

Hmm, the only case I can think of enforcing Automatic and boot image updates being enabled; but if we are making Automatic a status only enum, I'm not sure its necessary.

In platforms where we do not support boot image updates via the MCO, VAPs are in place to prevent setting the managed boot images field based on the infra object.

Thinking a bit more...Manual/None while having ManagedBootImages set to update all machine resources might be a bit strange to do, perhaps we can guard against that?

I think it would be strange yes, are there use cases though? In particular I could imagine a None being set with automatic, we shouldn't need image skew enforcement if we are on automatic as the auto will fix it for us, and therefore someone may want to turn the warnings off

So I sketched out a table to figure out the supported combinations:

Boot Images Skew Enforcement Result

All Auto Good

All Manual Error

All None Good

All Empty Good

Partial Auto Error

Partial Manual Error?

Partial None Good

Partial Empty Good

None Auto Error

None Manual Good

None None Good

None Empty Good

Empty Auto Error

Empty Manual Good

Empty None Good

Empty Empty Good

(This only applies to clusters that support boot image updates via the MCO, the other platforms would not permit editing of the ManagedBootImages field via the VAP)

In particular I could imagine a None being set with automatic, we shouldn't need image skew enforcement if we are on automatic as the auto will fix it for us

Hmm, are you suggesting that Automatic should imply that the boot image controller to disregard the values in ManagedBootImages knob and behave as it should update all resources? 🤔

openshift-ci-robot · 2025-08-07T15:07:32Z

@djoshy: This pull request references MCO-1669 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

Based on discussions from openshift/enhancements#1761
Proposed workflow:

flowchart TD
   A[Sync loop for skew mechanism. Triggered on: skew API knob updates, boot image update conditions] --> U{Is Skew API knob defined for this cluster?}    
   U --> |Yes| V{Determine skew API mode: Automatic, Manual or None?}
   V --> |None|J
   V --> |Manual|I 
   V --> |Automatic|E
   U --> |No| B{Can the MCO manage boot images for this cluster?}
   B --> |Yes| C[Skew API knob set to automatic]
   B --> |No| D[Skew API knob set to manual, cluster boot image estimated from cluster defaults]
   C --> E{Is the boot image controller disabled or in an error mode?}
   E --> |No| H[Wait until boot image controller is not progressing]
   E --> |Yes| G[Throw an error to cluster admin]
   G --> Done
   D --> I
   H --> I[Is skew compliant against the current release image?]   
   I --> |No| K[Set CO upgradeable=false] 
   I --> |Yes| Done
   K --> J[Disable scaling via MCS rejects & RHCOS templates]
   J --> Done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-08-07T15:13:22Z

@djoshy: This pull request references MCO-1669 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

Based on discussions from openshift/enhancements#1761
Proposed workflow:

flowchart TD
   A[Sync loop for skew mechanism. Triggered on: skew API knob updates, boot image update conditions] --> U{Is Skew API knob defined for this cluster?}    
   U --> |Yes| V{Determine skew API mode: Automatic, Manual or None?}
   V --> |None|J
   V --> |Manual|I 
   V --> |Automatic|E
   U --> |No| B{Can the MCO manage boot images for this cluster?}
   B --> |Yes| C[Skew API knob set to automatic]
   B --> |No| D[Skew API knob set to manual, cluster boot image estimated from cluster version]
   C --> E{Is the boot image controller disabled or in an error mode?}
   E --> |No| H[Wait until boot image controller is not progressing]
   E --> |Yes| G[Throw an error to cluster admin]
   G --> Done
   D --> I
   H --> I[Is skew compliant against the current release image?]   
   I --> |No| K[Set CO upgradeable=false] 
   I --> |Yes| Done
   K --> J[Disable scaling via MCS rejects & RHCOS templates]
   J --> Done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-08-07T18:57:50Z

@djoshy: This pull request references MCO-1669 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

Based on discussions from openshift/enhancements#1761:
Workflow for release (n-1)

flowchart TD
   A[Operator sync loop] --> B{Is the skew API spec/status set?}   
   B --> |No|C{Can the MCO manage boot images for this cluster?}
   B --> |Yes|E[Done]   
   C --> |No|D[Set Upgradeable=False to warn cluster admin]
   C --> |Yes|F[Set skew AI status to Automatic]
   D --> E
   F --> E

Workflow for release n

flowchart TD
   A[Sync loop for skew mechanism. Triggered on: skew API knob updates, boot image update conditions] --> V{Determine skew API mode: Automatic, Manual or None?}   
   V --> |None|L[Raise a low level Prometheus alert to indicate scaling risk]
   V --> |Manual|I 
   V --> |Automatic|E{Is the boot image controller disabled or in an error mode?}
   E --> |No| H[Wait until boot image controller is not progressing]
   E --> |Yes| G[Throw an error to cluster admin]
   G --> K
   H --> I[Is the current skew compliant against the limit defined in the release image?]   
   I --> |Yes| Done
   I --> |No| K[MCO degrades the cluster] 
   K --> Done
   L --> Done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-08-11T19:55:05Z

@djoshy: This pull request references MCO-1669 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

Based on discussions from openshift/enhancements#1761:
Workflow for pre-release skew enforcement not active

flowchart TD
   A[Operator sync loop] --> B{Is the skew API spec/status set?}   
   B --> |No|C{Can the MCO manage boot images for this cluster?}
   B --> |Yes|E[Done]   
   C --> |No|D[Set Upgradeable=False to force cluster admin opinion]
   C --> |Yes|F[Set skew API status to Automatic]
   D --> E
   F --> E

Workflow for release n, skew enforcement is active

flowchart TD
   A[Sync loop for skew mechanism. Triggered on: skew API knob updates, boot image update conditions] --> V{Determine skew API mode: Automatic, Manual or None?}   
   V --> |None|L[Raise a low level Prometheus alert to indicate scaling risk]
   V --> |Manual|I 
   V --> |Automatic|E{Is the boot image controller disabled or in an error mode?}
   E --> |No| H[Wait until boot image controller is not progressing]
   E --> |Yes| G[Throw an error to cluster admin]
   G --> K
   H --> I[Is the current skew compliant against the limit defined in the release image?]   
   I --> |Yes| Done
   I --> |No| K[MCO sets Upgradeable=False] 
   K --> Done
   L --> Done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

djoshy · 2025-08-15T19:11:25Z

Last push includes:

removed Automatic enum value from the spec
updated the go doc to account for our recent discussions about Manual mode, and how the user is required to set a value to allow upgrades for certain cases.
accounted for new linter rules

It seems to be failing NoNewRequiredFields in verify, which must be a recent change. Is it okay to not have a required tag for a union discriminators like Mode?

openshift-ci · 2025-08-15T21:54:53Z

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-serial-techpreview-2of2	`8181cc5`	link	true	`/test e2e-aws-serial-techpreview-2of2`
ci/prow/e2e-aws-serial-techpreview-1of2	`8181cc5`	link	true	`/test e2e-aws-serial-techpreview-1of2`
ci/prow/e2e-aws-ovn-hypershift-conformance	`8181cc5`	link	true	`/test e2e-aws-ovn-hypershift-conformance`
ci/prow/e2e-aws-ovn	`8181cc5`	link	true	`/test e2e-aws-ovn`
ci/prow/e2e-aws-ovn-hypershift	`8181cc5`	link	true	`/test e2e-aws-ovn-hypershift`
ci/prow/verify-crd-schema	`8181cc5`	link	true	`/test verify-crd-schema`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

everettraven · 2025-08-20T19:55:10Z

Last push includes:

removed Automatic enum value from the spec

updated the go doc to account for our recent discussions about Manual mode, and how the user is required to set a value to allow upgrades for certain cases.

accounted for new linter rules

It seems to be failing NoNewRequiredFields in verify, which must be a recent change. Is it okay to not have a required tag for a union discriminators like Mode?

@djoshy The NoNewRequiredFields isn't new from what I recall, but it is actually a bit buggy and picks up required children fields in a new optional parent field - which is OK. That appears to be what is happening in this case so we can override the check when we are ready for this to merge (overriding it now won't do any good as other things will likely merge in the meantime and re-trigger testing).

everettraven

Took another look at this and from what I recall this aligns with what was discussed synchronously. I'm going to circle back around to see if there are any outstanding comments left from Joel's previous review while I was out, but otherwise this direction looks good to me.

djoshy · 2025-08-20T20:09:43Z

Took another look at this and from what I recall this aligns with what was discussed synchronously. I'm going to circle back around to see if there are any outstanding comments left from Joel's previous review while I was out, but otherwise this direction looks good to me.

Sounds good, thanks! In the last API office hours, Joel had asked me to take a shot at figuring out cross validations between the ManagedBootImages & SkewEnforcement fields. I've drafted a validation table and linked it in the API office hours doc so we can talk through the scenarios in the next meeting. It did get a teeny bit complicated when you take spec/status into account 😅

everettraven · 2025-08-21T17:55:44Z

Took another look at this and from what I recall this aligns with what was discussed synchronously. I'm going to circle back around to see if there are any outstanding comments left from Joel's previous review while I was out, but otherwise this direction looks good to me.

Sounds good, thanks! In the last API office hours, Joel had asked me to take a shot at figuring out cross validations between the ManagedBootImages & SkewEnforcement fields. I've drafted a validation table and linked it in the API office hours doc so we can talk through the scenarios in the next meeting. It did get a teeny bit complicated when you take spec/status into account 😅

Sounds good! I took a look at that validation table, but lets take the next office hours to discuss that table and what each state means for an end user before moving forward with this as is. Thanks for letting me know about that table!

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 5, 2025

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 5, 2025

openshift-ci bot requested review from deads2k and everettraven June 5, 2025 16:51

djoshy mentioned this pull request Jun 5, 2025

MCO-1504: Update bootimage management enhancement openshift/enhancements#1761

Merged

everettraven reviewed Jun 17, 2025

View reviewed changes

djoshy force-pushed the skew-enforcement branch from eee6809 to 54938cf Compare July 14, 2025 20:21

djoshy force-pushed the skew-enforcement branch from 54938cf to 0344f1e Compare July 15, 2025 14:58

everettraven reviewed Jul 15, 2025

View reviewed changes

djoshy force-pushed the skew-enforcement branch from 0344f1e to 5816e99 Compare July 17, 2025 18:46

everettraven reviewed Jul 22, 2025

View reviewed changes

djoshy force-pushed the skew-enforcement branch from 5816e99 to ac57873 Compare July 23, 2025 20:26

yuqi-zhang reviewed Jul 28, 2025

View reviewed changes

djoshy force-pushed the skew-enforcement branch from ac57873 to 66a9ee8 Compare July 30, 2025 13:04

JoelSpeed reviewed Aug 6, 2025

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2025

djoshy force-pushed the skew-enforcement branch from 66a9ee8 to ced817a Compare August 15, 2025 18:02

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 15, 2025

machine_config: add BootImageSkewEnforcement API

8181cc5

djoshy force-pushed the skew-enforcement branch from ced817a to 8181cc5 Compare August 15, 2025 19:06

everettraven reviewed Aug 20, 2025

View reviewed changes

		// The default for mode is Automatic for clusters that support automatic boot image updates and
		// Manual for clusters that do not support automatic boot image updates.

		// None means that the MCO will no longer monitor the boot image skew. This may affect
		// the cluster's ability to scale.

Boot Images	Skew Enforcement	Result
All	Auto	Good
All	Manual	Error
All	None	Good
All	Empty	Good
Partial	Auto	Error
Partial	Manual	Error?
Partial	None	Good
Partial	Empty	Good
None	Auto	Error
None	Manual	Good
None	None	Good
None	Empty	Good
Empty	Auto	Error
Empty	Manual	Good
Empty	None	Good
Empty	Empty	Good

[WIP] MCO-1669: add BootImageSkewEnforcement API #2357

Are you sure you want to change the base?

[WIP] MCO-1669: add BootImageSkewEnforcement API #2357

Uh oh!

Conversation

djoshy commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jun 5, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jun 5, 2025

Uh oh!

openshift-ci bot commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

djoshy commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

everettraven left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

everettraven left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djoshy Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djoshy commented Jun 5, 2025 •

edited

Loading

openshift-ci-robot commented Jun 5, 2025 •

edited by openshift-ci bot

Loading

djoshy commented Jul 14, 2025 •

edited

Loading

djoshy Jul 29, 2025 •

edited

Loading

openshift-ci-robot commented Aug 7, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 7, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 7, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 11, 2025 •

edited by openshift-ci bot

Loading