Skip to content

Conversation

djoshy
Copy link
Contributor

@djoshy djoshy commented Jun 5, 2025

Based on discussions from openshift/enhancements#1761:
Workflow for pre-release skew enforcement not active

flowchart TD
    A[Operator sync loop] --> B{Is the skew API spec/status set?}   
    B --> |No|C{Can the MCO manage boot images for this cluster?}
    B --> |Yes|E[Done]   
    C --> |No|D[Set Upgradeable=False to force cluster admin opinion]
    C --> |Yes|F[Set skew API status to Automatic]
    D --> E
    F --> E
Loading

Workflow for release n, skew enforcement is active

flowchart TD
    A[Sync loop for skew mechanism. Triggered on: skew API knob updates, boot image update conditions] --> V{Determine skew API mode: Automatic, Manual or None?}   
    V --> |None|L[Raise a low level Prometheus alert to indicate scaling risk]
    V --> |Manual|I 
    V --> |Automatic|E{Is the boot image controller disabled or in an error mode?}
    E --> |No| H[Wait until boot image controller is not progressing]
    E --> |Yes| G[Throw an error to cluster admin]
    G --> K
    H --> I[Is the current skew compliant against the limit defined in the release image?]   
    I --> |Yes| Done
    I --> |No| K[MCO sets Upgradeable=False] 
    K --> Done
    L --> Done
Loading

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 5, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 5, 2025

@djoshy: This pull request references MCO-1669 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

WIP boot image enforcement API, based on discussions from openshift/enhancements#1761

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Jun 5, 2025

Hello @djoshy! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 5, 2025
@openshift-ci openshift-ci bot requested review from deads2k and everettraven June 5, 2025 16:51
Copy link
Contributor

openshift-ci bot commented Jun 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: djoshy
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@djoshy djoshy force-pushed the skew-enforcement branch from eee6809 to 54938cf Compare July 14, 2025 20:21
@djoshy
Copy link
Contributor Author

djoshy commented Jul 14, 2025

Thanks for the questions & review(sorry it took a while!), this should be ready for another look. Happy to hop on a call if that is easier.

Update: Did another push to fix up some tests.

@djoshy djoshy force-pushed the skew-enforcement branch from 54938cf to 0344f1e Compare July 15, 2025 14:58
Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving another handful of comments.

I'm also happy to hop on a call if you think it would be beneficial.

Comment on lines 81 to 85
// clusterBootImage describes the current boot image of the cluster. This will be used to enforce the skew limit.
// This value will be compared against the cluster's skew limit to determine skew compliance.
// Required when mode is set to "Automatic" or "Manual" and forbidden otherwise.
// +optional
ClusterBootImage *ClusterBootImage `json:"clusterBootImage,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, discriminated union members will follow the name of their mode counterpart. i.e:

mode: Automatic
automatic:
  ...

or

mode: Manual
manual:
  ...

Another thing I'm curious about now that I've got a bit more context - why do you want to require the clusterBootImage when set to Automatic?

Presumably, if the MCO is able to determine the cluster boot image by itself should it just do it and perform the skew handling automatically?

If a user were to explicitly set Automatic, I imagine they are wanting to have MCO handle all of that and that they likely don't have the cluster boot image information on hand. Whereas if they set Manual they are explicitly stating they want to manually manage that information and I would expect them to have it on hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. The determination of boot image version isn't straightforward and varies wildly per platform. There currently isn't a single source of truth for the admin or the controllers to use in the cluster. So I thought this would be a good way to represent that information in the API. Perhaps for the Automatic case; clusterBootImage makes more sense as a Status only field, but for Manual we could have it in Spec and Status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with this Spec/Status shape, but Automatic only specifies a version in the status version. So I'm envisioning something like the following examples.

On a cluster that defaults into Automatic mode(no admin opinion):

        spec:
        ..
        status:
          bootImageSkewEnforcementStatus:
            mode: Automatic
            automatic:
              ocpVersion: "4.18.2"
              rhcosVersion: "9.6.20250523-1"

On a cluster that defaults into manual mode(no admin opinion):

        spec:
        ..
        status:
          bootImageSkewEnforcementStatus:
            mode: Manual
            manual:
              ocpVersion: "4.18.2"

On a cluster that an admin explicitly sets to Manual, and performs updates:

        spec:
          bootImageSkewEnforcement:
            mode: Manual
            manual:
              ocpVersion: "4.18.2"
              rhcosVersion: "9.6.20250523-1"
        status:
          bootImageSkewEnforcementStatus:
            mode: Manual
            manual:
              ocpVersion: "4.18.2"
              rhcosVersion: "9.6.20250523-1"

On a cluster that an admin disables this feature:

        spec:
          bootImageSkewEnforcement:
            mode: None
        status:
          bootImageSkewEnforcementStatus:
            mode: None

On a cluster that an admin explicitly sets to Automatic:

        spec:
          bootImageSkewEnforcement:
            mode: Automatic
        status:
          bootImageSkewEnforcementStatus:
            mode: Automatic
            automatic:
              ocpVersion: "4.18.2"
              rhcosVersion: "9.6.20250523-1"

For this last case, I'm not entirely convinced if it needs to be supported. The user having the power to go to "Manual" and "None" via an explicit value makes sense.

Hmm, I guess a workflow to consider for this would be a user going from Manual/None to Automatic mode; would deleting the spec.bootImageSkewEnforcement be good UX that case? If the MCO is able to automatically determine that the cluster is able to perform skew management in a hands off fashion, it would default the status to Automatic(if spec is empty). Or would it be better to have an explicit Automatic setting in the spec?

@djoshy djoshy force-pushed the skew-enforcement branch from 0344f1e to 5816e99 Compare July 17, 2025 18:46
Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, but other than these I think this looks good

Comment on lines 61 to 62
// The default for mode is Automatic for clusters that support automatic boot image updates and
// Manual for clusters that do not support automatic boot image updates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I know whether or not my cluster supports automatic boot image updates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So on cluster, you'll be able to see this by checking the MachineConfiguration CR. When spec.managedBootImages is undefined, we interpret that as no opinion from the admin and status.managedBootImagesStatus will reflect the default boot image configuration(platform dependent as I mentioned earlier). managedBootImagesStatus may

  • contain a MachineManager set to "All". This is what I refer to as automatic/opted-in.
  • contain aMachineManager set to "None". This is a platform that does support boot image updates; but not by default. The admin has to opt-in via defining a machineManager in spec.managedBootImages. We may move this to default in a later release once we have gained enough confidence in the platform.
  • be undefined. This means a platform that we have not yet explored for boot image updates. The admin is prevented from adding to spec.managedBootImages for these cases via a validating admission policy.

Other than that, we hope to socialize new platforms we add support/default for via documentation, KB articles and such.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the UX fully here, so it would be helpful if you can walk me through this again. At some point a user will upgrade to a version with this skew enforcement API in place. Let's break them down into a few categories:

  1. I'm on cloud and I just want automatic to be on -> in this case, assuming they haven't fiddled with the default-on bootimage management fields, they should not need to do anything. The spec will be empty and the status will be set by the MCO to automatic (since we detect they are in a managed case, right?)
  2. I'm on cloud but with custom bootimages -> in this case, I guess the MCO doesn't know what to do with the existing image since it's not in any stream, so we default to None or just have no status? What happens now, do we alert the user and/or prevent scaling?
  3. I'm on-prem with manual bootimage -> in this case, I would not have been able to set these fields until they were turned on, so I would need to upgrade first, then set manual here, and until that happens, the status is empty?

Copy link
Contributor Author

@djoshy djoshy Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No dumb questions, these are great! Before I answer, I want to explain how I think about how the MCO would detect the current cluster boot image(I'm open to other ideas/strategies here!):

  • If boot image updates are enabled; and all machine resources have been successfully reconciled, we could use the current cluster version & the rhcos version from the coreos-bootimages configmap to determine the current boot image.
  • If boot image updates are only partially enabled or completely disabled; and machine resources can't be inspected to determine the boot image version, we would have to assume the boot image to be the cluster's installation version from the version history. For most platforms, we will very likely have to use latter method.

Now to your questions:

  1. Yes, this is correct.

  2. For custom boot images in the cloud platforms, the MCO currently (silently) skips over updating them. We have a couple of paths here:

    • Once we've covered all the marketplace/manageable scenarios for cloud platforms, we can add a degrade/error when the MCO encounters a custom boot image during an update. I expect most of these custom boot image users to disable boot image updates to fix the error. When we do enable skew enforcement(SE) in this case, the cluster would immediately be considered out of skew as the current boot image can't be estimated. If they leave the error in place and not disable boot image updates, SE should also consider a cluster to be in that error mode be out of skew.
    • If we choose to not have an error mode for the custom boot image case; we will still want add a method to indicate to the MCO that machine resources have been skipped over and not reconciled. With that in place, when skew enforcement is enabled, the MCO would be able to determine that the cluster is out of skew.

    In both of these paths, the admin would be expected to manually switch SE spec to Manual(and indicate their boot image) or to None depending on their scaling needs. Since this case involves a cloud env that does likely support automatic boot image updates, the SE status would initially be set to Automatic by the MCO, and as stated earlier, the detected cluster boot image.

  3. For any case where the MCO cannot automatically manage boot images, the MCO will default to Manual mode in the SE status, along with the cluster's boot image. The admin would be expected to manually switch SE spec to Manual(and indicate their boot image) or to None depending on their scaling needs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we will try to set some value in the status ourselves based on our interpretation of the cluster, and error if we cannot. The user can set a spec as they wish to override it (so in the second example, if I used some unfound bootimage, I could still manually set spec to automatic to allow the MCO to go ahead and ignore the error, is that correct?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Automatic clusters, most cases would need no intervention; except if the admin has manually set a custom boot image. I don't think this is very common(cc @yuqi-zhang in case I'm wrong here)

Based on our discussions, I think my understanding is that if the user has set a custom bootimage, our mechanisms will not find it in the streams (regular, managed (ARO/ROSA), marketplace (OPP, OKE, OCP)) so we would end up defaulting to manual and require user intervention, right? Or are you saying that all cloud platforms would default to automatic, and if the user somehow has set their own value, we'd instead just error and let the user sort it out?

There's a few things we discussed in this thread, but to recap some of the main open points:

  1. spec vs. status and defaulting - we will only default to "automatic" or "manual" if the user doesn't input, and the contentious point is that by setting it to "manual" by default, we'd need immediate user intervention
  2. the mechanism of alerting the user of actions/interventions - whether we set upgradeable to false, or some other mechanism

Is that accurate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the contentious point is that by setting it to "manual" by default, we'd need immediate user intervention

I think the contentious point is that this seems like a setting that admins need to be explicitly aware of, and need to make a decision on, if we cannot automatically determine the state this should be put into. Otherwise, they may not be able to upgrade and/or scale as they are used to (is my understanding of this correct?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the criteria by which the MCO will determine whether a cluster is automatic or manual? Was there a decision tree in the EP that covers this?

  • Platform - Some platforms are never automatic?
  • Whether the existing boot images are known? Someone using a custom boot image doesn't get updates?
  • ???

Do I need a managedBootImages configuration for the cluster to be automatic?

What would happen if an admin set this to Automatic on a cluster that didn't support automatic updates? Is that theoretically possible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on the update flow.

If we can determine that the cluster is in a suitable condition to be adopted for automatic updates, I think defaulting status to automatic makes sense.

If we cannot determine that the cluster is suitable for automatic updates, there's a mention in the thread that we will just use the version from the cluster image, how does that work? How can we confident that this image is correct? Maybe I'm missing what you mean there?

If we cannot automatically ascertain the version figures, how would we populate the manual spec field with either the RHCOS or OCP versions? Wouldn't we need the admin to make an explicit choice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the description workflows based on our discussion, PTAL to confirm we're all on the same page. I think we also settled on "Automatic" not being settable from the spec side, but I wanted to double check that before updating the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some tests for the status field as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL when you have a sec!

@djoshy djoshy force-pushed the skew-enforcement branch from 5816e99 to ac57873 Compare July 23, 2025 20:26
Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of maybe dumb questions:

Comment on lines 61 to 62
// The default for mode is Automatic for clusters that support automatic boot image updates and
// Manual for clusters that do not support automatic boot image updates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the UX fully here, so it would be helpful if you can walk me through this again. At some point a user will upgrade to a version with this skew enforcement API in place. Let's break them down into a few categories:

  1. I'm on cloud and I just want automatic to be on -> in this case, assuming they haven't fiddled with the default-on bootimage management fields, they should not need to do anything. The spec will be empty and the status will be set by the MCO to automatic (since we detect they are in a managed case, right?)
  2. I'm on cloud but with custom bootimages -> in this case, I guess the MCO doesn't know what to do with the existing image since it's not in any stream, so we default to None or just have no status? What happens now, do we alert the user and/or prevent scaling?
  3. I'm on-prem with manual bootimage -> in this case, I would not have been able to set these fields until they were turned on, so I would need to upgrade first, then set manual here, and until that happens, the status is empty?

// +kubebuilder:validation:MaxLength:=21
// +kubebuilder:validation:MinLength:=14
// +optional
RHCOSVersion string `json:"rhcosVersion,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what scenarios do you expect the user to explicitly set this? The one I thought of was: I'm not sure what OCP version my RHCOS corresponds to, or I have some type of custom bootimage.

In that case, wouldn't it be more helpful to have one of OCPVersion or RHCOSVersion be required, and if one is set, you don't have to set the other? But you can set both if you want more strict version checking? (is that supported via API validation?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this a valid use case, and should be possible as a validation at the parent level. I'll try updating it to do this.

@djoshy djoshy force-pushed the skew-enforcement branch from ac57873 to 66a9ee8 Compare July 30, 2025 13:04
Comment on lines 61 to 62
// The default for mode is Automatic for clusters that support automatic boot image updates and
// Manual for clusters that do not support automatic boot image updates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the criteria by which the MCO will determine whether a cluster is automatic or manual? Was there a decision tree in the EP that covers this?

  • Platform - Some platforms are never automatic?
  • Whether the existing boot images are known? Someone using a custom boot image doesn't get updates?
  • ???

Do I need a managedBootImages configuration for the cluster to be automatic?

What would happen if an admin set this to Automatic on a cluster that didn't support automatic updates? Is that theoretically possible?

Comment on lines +79 to +90
// None means that the MCO will no longer monitor the boot image skew. This may affect
// the cluster's ability to scale.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can expand on this a little, probably useful to explain that this means the cluster has no way to understand the compatibility between X and Y, where X and Y are the boot image and the ignition/pivot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, will do!

Comment on lines 61 to 62
// The default for mode is Automatic for clusters that support automatic boot image updates and
// Manual for clusters that do not support automatic boot image updates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on the update flow.

If we can determine that the cluster is in a suitable condition to be adopted for automatic updates, I think defaulting status to automatic makes sense.

If we cannot determine that the cluster is suitable for automatic updates, there's a mention in the thread that we will just use the version from the cluster image, how does that work? How can we confident that this image is correct? Maybe I'm missing what you mean there?

If we cannot automatically ascertain the version figures, how would we populate the manual spec field with either the RHCOS or OCP versions? Wouldn't we need the admin to make an explicit choice?

@@ -55,8 +53,123 @@ type MachineConfigurationSpec struct {
// has no effect on cluster upgrades which will still incur node disruption where required.
// +optional
NodeDisruptionPolicy NodeDisruptionPolicyConfig `json:"nodeDisruptionPolicy"`

// bootImageSkewEnforcement is an optional field that can be used to configure how version skew is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a relationship between this new field and the managed boot images that we need to enforce?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the only case I can think of enforcing Automatic and boot image updates being enabled; but if we are making Automatic a status only enum, I'm not sure its necessary.

In platforms where we do not support boot image updates via the MCO, VAPs are in place to prevent setting the managed boot images field based on the infra object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit more...Manual/None while having ManagedBootImages set to update all machine resources might be a bit strange to do, perhaps we can guard against that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be strange yes, are there use cases though? In particular I could imagine a None being set with automatic, we shouldn't need image skew enforcement if we are on automatic as the auto will fix it for us, and therefore someone may want to turn the warnings off

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I sketched out a table to figure out the supported combinations:

Boot Images Skew Enforcement Result
All Auto Good
All Manual Error
All None Good
All Empty Good
Partial Auto Error
Partial Manual Error?
Partial None Good
Partial Empty Good
None Auto Error
None Manual Good
None None Good
None Empty Good
Empty Auto Error
Empty Manual Good
Empty None Good
Empty Empty Good

(This only applies to clusters that support boot image updates via the MCO, the other platforms would not permit editing of the ManagedBootImages field via the VAP)

In particular I could imagine a None being set with automatic, we shouldn't need image skew enforcement if we are on automatic as the auto will fix it for us

Hmm, are you suggesting that Automatic should imply that the boot image controller to disregard the values in ManagedBootImages knob and behave as it should update all resources? 🤔

@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 7, 2025

@djoshy: This pull request references MCO-1669 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

Based on discussions from openshift/enhancements#1761
Proposed workflow:

flowchart TD
   A[Sync loop for skew mechanism. Triggered on: skew API knob updates, boot image update conditions] --> U{Is Skew API knob defined for this cluster?}    
   U --> |Yes| V{Determine skew API mode: Automatic, Manual or None?}
   V --> |None|J
   V --> |Manual|I 
   V --> |Automatic|E
   U --> |No| B{Can the MCO manage boot images for this cluster?}
   B --> |Yes| C[Skew API knob set to automatic]
   B --> |No| D[Skew API knob set to manual, cluster boot image estimated from cluster defaults]
   C --> E{Is the boot image controller disabled or in an error mode?}
   E --> |No| H[Wait until boot image controller is not progressing]
   E --> |Yes| G[Throw an error to cluster admin]
   G --> Done
   D --> I
   H --> I[Is skew compliant against the current release image?]   
   I --> |No| K[Set CO upgradeable=false] 
   I --> |Yes| Done
   K --> J[Disable scaling via MCS rejects & RHCOS templates]
   J --> Done
Loading

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 7, 2025

@djoshy: This pull request references MCO-1669 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

Based on discussions from openshift/enhancements#1761
Proposed workflow:

flowchart TD
   A[Sync loop for skew mechanism. Triggered on: skew API knob updates, boot image update conditions] --> U{Is Skew API knob defined for this cluster?}    
   U --> |Yes| V{Determine skew API mode: Automatic, Manual or None?}
   V --> |None|J
   V --> |Manual|I 
   V --> |Automatic|E
   U --> |No| B{Can the MCO manage boot images for this cluster?}
   B --> |Yes| C[Skew API knob set to automatic]
   B --> |No| D[Skew API knob set to manual, cluster boot image estimated from cluster version]
   C --> E{Is the boot image controller disabled or in an error mode?}
   E --> |No| H[Wait until boot image controller is not progressing]
   E --> |Yes| G[Throw an error to cluster admin]
   G --> Done
   D --> I
   H --> I[Is skew compliant against the current release image?]   
   I --> |No| K[Set CO upgradeable=false] 
   I --> |Yes| Done
   K --> J[Disable scaling via MCS rejects & RHCOS templates]
   J --> Done
Loading

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 7, 2025

@djoshy: This pull request references MCO-1669 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

Based on discussions from openshift/enhancements#1761:
Workflow for release (n-1)

flowchart TD
   A[Operator sync loop] --> B{Is the skew API spec/status set?}   
   B --> |No|C{Can the MCO manage boot images for this cluster?}
   B --> |Yes|E[Done]   
   C --> |No|D[Set Upgradeable=False to warn cluster admin]
   C --> |Yes|F[Set skew AI status to Automatic]
   D --> E
   F --> E
Loading

Workflow for release n

flowchart TD
   A[Sync loop for skew mechanism. Triggered on: skew API knob updates, boot image update conditions] --> V{Determine skew API mode: Automatic, Manual or None?}   
   V --> |None|L[Raise a low level Prometheus alert to indicate scaling risk]
   V --> |Manual|I 
   V --> |Automatic|E{Is the boot image controller disabled or in an error mode?}
   E --> |No| H[Wait until boot image controller is not progressing]
   E --> |Yes| G[Throw an error to cluster admin]
   G --> K
   H --> I[Is the current skew compliant against the limit defined in the release image?]   
   I --> |Yes| Done
   I --> |No| K[MCO degrades the cluster] 
   K --> Done
   L --> Done
Loading

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 11, 2025

@djoshy: This pull request references MCO-1669 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

Based on discussions from openshift/enhancements#1761:
Workflow for pre-release skew enforcement not active

flowchart TD
   A[Operator sync loop] --> B{Is the skew API spec/status set?}   
   B --> |No|C{Can the MCO manage boot images for this cluster?}
   B --> |Yes|E[Done]   
   C --> |No|D[Set Upgradeable=False to force cluster admin opinion]
   C --> |Yes|F[Set skew API status to Automatic]
   D --> E
   F --> E
Loading

Workflow for release n, skew enforcement is active

flowchart TD
   A[Sync loop for skew mechanism. Triggered on: skew API knob updates, boot image update conditions] --> V{Determine skew API mode: Automatic, Manual or None?}   
   V --> |None|L[Raise a low level Prometheus alert to indicate scaling risk]
   V --> |Manual|I 
   V --> |Automatic|E{Is the boot image controller disabled or in an error mode?}
   E --> |No| H[Wait until boot image controller is not progressing]
   E --> |Yes| G[Throw an error to cluster admin]
   G --> K
   H --> I[Is the current skew compliant against the limit defined in the release image?]   
   I --> |Yes| Done
   I --> |No| K[MCO sets Upgradeable=False] 
   K --> Done
   L --> Done
Loading

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 15, 2025
@djoshy
Copy link
Contributor Author

djoshy commented Aug 15, 2025

Last push includes:

  • removed Automatic enum value from the spec
  • updated the go doc to account for our recent discussions about Manual mode, and how the user is required to set a value to allow upgrades for certain cases.
  • accounted for new linter rules

It seems to be failing NoNewRequiredFields in verify, which must be a recent change. Is it okay to not have a required tag for a union discriminators like Mode?

Copy link
Contributor

openshift-ci bot commented Aug 15, 2025

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-serial-techpreview-2of2 8181cc5 link true /test e2e-aws-serial-techpreview-2of2
ci/prow/e2e-aws-serial-techpreview-1of2 8181cc5 link true /test e2e-aws-serial-techpreview-1of2
ci/prow/e2e-aws-ovn-hypershift-conformance 8181cc5 link true /test e2e-aws-ovn-hypershift-conformance
ci/prow/e2e-aws-ovn 8181cc5 link true /test e2e-aws-ovn
ci/prow/e2e-aws-ovn-hypershift 8181cc5 link true /test e2e-aws-ovn-hypershift
ci/prow/verify-crd-schema 8181cc5 link true /test verify-crd-schema

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@everettraven
Copy link
Contributor

Last push includes:

  • removed Automatic enum value from the spec
  • updated the go doc to account for our recent discussions about Manual mode, and how the user is required to set a value to allow upgrades for certain cases.
  • accounted for new linter rules

It seems to be failing NoNewRequiredFields in verify, which must be a recent change. Is it okay to not have a required tag for a union discriminators like Mode?

@djoshy The NoNewRequiredFields isn't new from what I recall, but it is actually a bit buggy and picks up required children fields in a new optional parent field - which is OK. That appears to be what is happening in this case so we can override the check when we are ready for this to merge (overriding it now won't do any good as other things will likely merge in the meantime and re-trigger testing).

Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look at this and from what I recall this aligns with what was discussed synchronously. I'm going to circle back around to see if there are any outstanding comments left from Joel's previous review while I was out, but otherwise this direction looks good to me.

@djoshy
Copy link
Contributor Author

djoshy commented Aug 20, 2025

Took another look at this and from what I recall this aligns with what was discussed synchronously. I'm going to circle back around to see if there are any outstanding comments left from Joel's previous review while I was out, but otherwise this direction looks good to me.

Sounds good, thanks! In the last API office hours, Joel had asked me to take a shot at figuring out cross validations between the ManagedBootImages & SkewEnforcement fields. I've drafted a validation table and linked it in the API office hours doc so we can talk through the scenarios in the next meeting. It did get a teeny bit complicated when you take spec/status into account 😅

@everettraven
Copy link
Contributor

Took another look at this and from what I recall this aligns with what was discussed synchronously. I'm going to circle back around to see if there are any outstanding comments left from Joel's previous review while I was out, but otherwise this direction looks good to me.

Sounds good, thanks! In the last API office hours, Joel had asked me to take a shot at figuring out cross validations between the ManagedBootImages & SkewEnforcement fields. I've drafted a validation table and linked it in the API office hours doc so we can talk through the scenarios in the next meeting. It did get a teeny bit complicated when you take spec/status into account 😅

Sounds good! I took a look at that validation table, but lets take the next office hours to discuss that table and what each state means for an end user before moving forward with this as is. Thanks for letting me know about that table!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants