Skip to content

MCO-1877: MCO-1879: MCO-1882: MCO-1884: Implement boot image skew enforcement MVP#5428

Merged
openshift-merge-bot[bot] merged 6 commits intoopenshift:mainfrom
djoshy:implement-skew-enforcement
Feb 5, 2026
Merged

MCO-1877: MCO-1879: MCO-1882: MCO-1884: Implement boot image skew enforcement MVP#5428
openshift-merge-bot[bot] merged 6 commits intoopenshift:mainfrom
djoshy:implement-skew-enforcement

Conversation

@djoshy
Copy link
Contributor

@djoshy djoshy commented Nov 19, 2025

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator sets Upgradeable=False when it detects the cluster is out of skew, determined by comparing the boot image values in bootImageSkewEnforcementStatus against the MCO's hardcoded skew limits. Before performing this check, the operator first verifies that the controller is neither in an error state nor currently performing boot image updates. If the controller is in an error state, the operator sets Upgradeable=False and propagates that error instead of proceeding with the skew check. If the controller is mid-update, the operator defers the skew check until later; this is to avoid race conditions.
  • Some unit tests have been added to sync_test.go and status_test.go to verify the above mechanisms.

Verifying API behavior

This verification will have to be done based on the platform. If the platform:

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
  spec:
    logLevel: Normal
    managementState: Managed
    operatorLogLevel: Normal
  status:
    bootImageSkewEnforcementStatus:
      automatic:
        ocpVersion: 4.21.0
      mode: Automatic
    conditions:
    - lastTransitionTime: "2025-11-19T22:06:06Z"
      message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
        | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
      reason: BootImageConfigMapAdded
      status: "False"
      type: BootImageUpdateProgressing
    - lastTransitionTime: "2025-11-19T22:06:07Z"
      message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
        0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
      reason: BootImageConfigMapAdded
      status: "False"
      type: BootImageUpdateDegraded
    managedBootImagesStatus:
      machineManagers:
      - apiGroup: machine.openshift.io
        resource: machinesets
        selection:
          mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
  spec:
    logLevel: Normal
    managementState: Managed
    operatorLogLevel: Normal
  status:
    bootImageSkewEnforcementStatus:
      manual:
        mode: OCPVersion
        ocpVersion: 4.21.0
      mode: Manual
    conditions:
    - lastTransitionTime: "2025-11-19T22:06:06Z"
      message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
        | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
      reason: BootImageConfigMapAdded
      status: "False"
      type: BootImageUpdateProgressing
    - lastTransitionTime: "2025-11-19T22:06:07Z"
      message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
        0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
      reason: BootImageConfigMapAdded
      status: "False"
      type: BootImageUpdateDegraded
    managedBootImagesStatus:
      machineManagers:
      - apiGroup: machine.openshift.io
        resource: machinesets
        selection:
          mode: None

The admin can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

  spec:
    logLevel: Normal
    managementState: Managed
    operatorLogLevel: Normal
    managedBootImages:
      machineManagers:
      - apiGroup: machine.openshift.io
        resource: machinesets
        selection:
          mode: All
  status:
    bootImageSkewEnforcementStatus:
      automatic:
        ocpVersion: 4.21.0
      mode: Automatic
    conditions:
    - lastTransitionTime: "2025-11-19T22:06:06Z"
      message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
        | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
      reason: BootImageConfigMapAdded
      status: "False"
      type: BootImageUpdateProgressing
    - lastTransitionTime: "2025-11-19T22:06:07Z"
      message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
        0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
      reason: BootImageConfigMapAdded
      status: "False"
      type: BootImageUpdateDegraded
    managedBootImagesStatus:
      machineManagers:
      - apiGroup: machine.openshift.io
        resource: machinesets
        selection:
          mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
  spec:
    logLevel: Normal
    managementState: Managed
    operatorLogLevel: Normal
  status:
    bootImageSkewEnforcementStatus:
      manual:
        mode: OCPVersion
        ocpVersion: 4.21.0
      mode: Manual

In this case, the admin is expected to manually perform boot image updates and then add a spec field like so:

spec:
  bootImageSkewEnforcement:
    mode: Manual
    manual:
      mode: OCPVersion
      ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
  bootImageSkewEnforcement:
    mode: Manual
    manual:
      mode: OCPVersion
      ocpVersion: 4.21.2
status:
  bootImageSkewEnforcementStatus:
      mode: OCPVersion
      ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the admin can also choose to to store the RHCOSVersion, like so:

spec:
  bootImageSkewEnforcement:
    mode: Manual
    manual:
      mode: RHCOSVersion
      rhcosVersion: 9.0.20251023-0
status:
  bootImageSkewEnforcementStatus:
    mode: Manual
    manual:
      mode: RHCOSVersion
      rhcosVersion: 9.0.20251023-0

Note that only one of RHCOSVersion or OCPVersion is permitted in Manual mode.

The admin can also choose to disable skew enforcement altogether by setting it None mode in spec.

spec:
  bootImageSkewEnforcement:
    mode: None
status:
  bootImageSkewEnforcementStatus:
    mode: None

Verifying upgrade block

Upgrades will be blocked when the cluster is to determined out of skew. This mechanism works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in bootImageSkewEnforcementStatus and set Upgradeable=False if necessary. To verify this, first set the mode to Manual with an out of skew boot image version like so:

  spec:
    bootImageSkewEnforcement:
      manual:
	mode: RHCOSVersion
        rhcosVersion: 9.0.20251023-0
      mode: Manual

Now, examine the machine-config CO object's conditions field, it should indicate an issue preventing upgrades like so:

$ oc get co machine-config -o yaml
...
  - lastTransitionTime: "2025-11-20T15:15:12Z"
    message: 'Upgrades have been disabled because the cluster is using RHCOS boot
      image version 9.0.20251023-0(RHEL version: 9.0), which is below the minimum
      required RHEL version 9.2. To enable upgrades, please update your boot images
      following the documentation at [TODO: insert link], or disable boot image skew
      enforcement at [TODO: insert link]'
    reason: ClusterBootImageSkewError
    status: "False"
    type: Upgradeable

Next, set the boot image to one within the skew limits:

  spec:
    bootImageSkewEnforcement:
      manual:
	mode: RHCOSVersion
        rhcosVersion: 9.2.20251023-0
      mode: Manual

Then, the Upgradeable condition should be restored back to True

  - lastTransitionTime: "2025-11-20T15:19:25Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable

These set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in Automatic and Manual mode however, as Automatic is only permitted on the status side, I don't think there is an easy way to test that(other than the units I've included).

In None mode, this version check should not take place.

Some caveats to note about Automatic mode:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, API validations will prevent changing the boot image configuration to a setting other than All. To change the boot image configuration, the admin is first expected to go to Manual skew enforcement mode and then attempt to change the boot image configuration of the cluster.
  3. In Automatic mode, if any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  4. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 19, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 19, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

- What I did

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster conditions.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.

I've also added unit tests to verify the behaviors above.

- How to verify it
[TBD]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 19, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

- What I did

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.

I've also added a few unit tests to verify the above behaviors.

- How to verify it
The verification will have to be done based on the platform. If the platform

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: None

The user can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
   managedBootImages:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual

In this case, the user is expected to manually perform boot image updates and then add a spec field like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2
status:
 bootImageSkewEnforcementStatus:
     mode: OCPVersion
     ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the user can also choose to to store the RHCOSVersion, like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0
status:
 bootImageSkewEnforcementStatus:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0

Some caveats to note:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, If any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  3. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.

I've also added a few unit tests to verify the above behaviors.

Verifying API behavior

This verification will have to be done based on the platform. If the platform...

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: None

The user can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
   managedBootImages:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual

In this case, the user is expected to manually perform boot image updates and then add a spec field like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2
status:
 bootImageSkewEnforcementStatus:
     mode: OCPVersion
     ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the user can also choose to to store the RHCOSVersion, like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0
status:
 bootImageSkewEnforcementStatus:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0

The admin can also choose to disable skew enforcement altogether by setting it None mode in spec.

spec:
 bootImageSkewEnforcement:
   mode: None
status:
 bootImageSkewEnforcementStatus:
   mode: None

Verifying upgrade block

Upgrades will be blocked when the cluster is to determined out of skew. This piece works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in bootImageSkewEnforcementStatus and set Upgradeable=False if necessary. To verify set the mode to Manual with an out of skew boot image version like so:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.0.20251023-0
     mode: Manual

Now, examine the CO object named machine-config's conditions field, it should show indicate an issue preventing upgrades like so:

 - lastTransitionTime: "2025-11-20T15:15:12Z"
   message: 'Upgrades have been disabled because the cluster is using RHCOS boot
     image version 9.0.20251023-0(RHEL version: 9.0), which is below the minimum
     required RHEL version 9.2. To enable upgrades, please update your boot images
     following the documentation at [TODO: insert link], or disable boot image skew
     enforcement at [TODO: insert link]'
   reason: ClusterBootImageSkewError
   status: "False"
   type: Upgradeable

Next, set the boot image to one within the skew limits:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.2.20251023-0
     mode: Manual

Then, the Upgradeable condition should be restored back to True

 - lastTransitionTime: "2025-11-20T15:19:25Z"
   reason: AsExpected
   status: "True"
   type: Upgradeable

These set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in Automatic and Manual mode, however as Automatic is only generated status side, I don't think there is an easy way to test that(other than the units I've included).

In None mode, this version check should not take place.

Some caveats to note:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, If any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  3. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.
  • I've also added a few unit tests to verify the above mechanisms.

Verifying API behavior

This verification will have to be done based on the platform. If the platform:

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: None

The admin can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
   managedBootImages:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual

In this case, the admin is expected to manually perform boot image updates and then add a spec field like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2
status:
 bootImageSkewEnforcementStatus:
     mode: OCPVersion
     ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the admin can also choose to to store the RHCOSVersion, like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0
status:
 bootImageSkewEnforcementStatus:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0

The admin can also choose to disable skew enforcement altogether by setting it None mode in spec.

spec:
 bootImageSkewEnforcement:
   mode: None
status:
 bootImageSkewEnforcementStatus:
   mode: None

Verifying upgrade block

Upgrades will be blocked when the cluster is to determined out of skew. This piece works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in bootImageSkewEnforcementStatus and set Upgradeable=False if necessary. To verify set the mode to Manual with an out of skew boot image version like so:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.0.20251023-0
     mode: Manual

Now, examine the CO object named machine-config's conditions field, it should show indicate an issue preventing upgrades like so:

 - lastTransitionTime: "2025-11-20T15:15:12Z"
   message: 'Upgrades have been disabled because the cluster is using RHCOS boot
     image version 9.0.20251023-0(RHEL version: 9.0), which is below the minimum
     required RHEL version 9.2. To enable upgrades, please update your boot images
     following the documentation at [TODO: insert link], or disable boot image skew
     enforcement at [TODO: insert link]'
   reason: ClusterBootImageSkewError
   status: "False"
   type: Upgradeable

Next, set the boot image to one within the skew limits:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.2.20251023-0
     mode: Manual

Then, the Upgradeable condition should be restored back to True

 - lastTransitionTime: "2025-11-20T15:19:25Z"
   reason: AsExpected
   status: "True"
   type: Upgradeable

These set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in Automatic and Manual mode, however as Automatic is only generated status side, I don't think there is an easy way to test that(other than the units I've included).

In None mode, this version check should not take place.

Some caveats to note about Automatic mode:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, API validations will prevent changing the boot image configuration to a setting other than All. To change the boot image configuration, the admin is first expected to go to Manual skew enforcement mode and then attempt to change the boot image configuration of the cluster.
  3. In Automatic mode, If any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  4. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.
  • Some unit tests have been added to sync_test.go and status_test.go to verify the above mechanisms.

Verifying API behavior

This verification will have to be done based on the platform. If the platform:

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: None

The admin can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
   managedBootImages:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual

In this case, the admin is expected to manually perform boot image updates and then add a spec field like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2
status:
 bootImageSkewEnforcementStatus:
     mode: OCPVersion
     ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the admin can also choose to to store the RHCOSVersion, like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0
status:
 bootImageSkewEnforcementStatus:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0

Note that only one of RHCOSVersion or OCPVersion is permitted in Manual mode.

The admin can also choose to disable skew enforcement altogether by setting it None mode in spec.

spec:
 bootImageSkewEnforcement:
   mode: None
status:
 bootImageSkewEnforcementStatus:
   mode: None

Verifying upgrade block

Upgrades will be blocked when the cluster is to determined out of skew. This mechanism works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in bootImageSkewEnforcementStatus and set Upgradeable=False if necessary. To verify this, first set the mode to Manual with an out of skew boot image version like so:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.0.20251023-0
     mode: Manual

Now, examine the CO object named machine-config's conditions field, it should show indicate an issue preventing upgrades like so:

 - lastTransitionTime: "2025-11-20T15:15:12Z"
   message: 'Upgrades have been disabled because the cluster is using RHCOS boot
     image version 9.0.20251023-0(RHEL version: 9.0), which is below the minimum
     required RHEL version 9.2. To enable upgrades, please update your boot images
     following the documentation at [TODO: insert link], or disable boot image skew
     enforcement at [TODO: insert link]'
   reason: ClusterBootImageSkewError
   status: "False"
   type: Upgradeable

Next, set the boot image to one within the skew limits:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.2.20251023-0
     mode: Manual

Then, the Upgradeable condition should be restored back to True

 - lastTransitionTime: "2025-11-20T15:19:25Z"
   reason: AsExpected
   status: "True"
   type: Upgradeable

These set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in Automatic and Manual mode, however as Automatic is only permitted on the status side, I don't think there is an easy way to test that(other than the units I've included).

In None mode, this version check should not take place.

Some caveats to note about Automatic mode:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, API validations will prevent changing the boot image configuration to a setting other than All. To change the boot image configuration, the admin is first expected to go to Manual skew enforcement mode and then attempt to change the boot image configuration of the cluster.
  3. In Automatic mode, If any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  4. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.
  • Some unit tests have been added to sync_test.go and status_test.go to verify the above mechanisms.

Verifying API behavior

This verification will have to be done based on the platform. If the platform:

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: None

The admin can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
   managedBootImages:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual

In this case, the admin is expected to manually perform boot image updates and then add a spec field like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2
status:
 bootImageSkewEnforcementStatus:
     mode: OCPVersion
     ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the admin can also choose to to store the RHCOSVersion, like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0
status:
 bootImageSkewEnforcementStatus:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0

Note that only one of RHCOSVersion or OCPVersion is permitted in Manual mode.

The admin can also choose to disable skew enforcement altogether by setting it None mode in spec.

spec:
 bootImageSkewEnforcement:
   mode: None
status:
 bootImageSkewEnforcementStatus:
   mode: None

Verifying upgrade block

Upgrades will be blocked when the cluster is to determined out of skew. This mechanism works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in bootImageSkewEnforcementStatus and set Upgradeable=False if necessary. To verify this, first set the mode to Manual with an out of skew boot image version like so:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.0.20251023-0
     mode: Manual

Now, examine the CO object named machine-config's conditions field, it should show indicate an issue preventing upgrades like so:

 - lastTransitionTime: "2025-11-20T15:15:12Z"
   message: 'Upgrades have been disabled because the cluster is using RHCOS boot
     image version 9.0.20251023-0(RHEL version: 9.0), which is below the minimum
     required RHEL version 9.2. To enable upgrades, please update your boot images
     following the documentation at [TODO: insert link], or disable boot image skew
     enforcement at [TODO: insert link]'
   reason: ClusterBootImageSkewError
   status: "False"
   type: Upgradeable

Next, set the boot image to one within the skew limits:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.2.20251023-0
     mode: Manual

Then, the Upgradeable condition should be restored back to True

 - lastTransitionTime: "2025-11-20T15:19:25Z"
   reason: AsExpected
   status: "True"
   type: Upgradeable

These set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in Automatic and Manual mode however, as Automatic is only permitted on the status side, I don't think there is an easy way to test that(other than the units I've included).

In None mode, this version check should not take place.

Some caveats to note about Automatic mode:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, API validations will prevent changing the boot image configuration to a setting other than All. To change the boot image configuration, the admin is first expected to go to Manual skew enforcement mode and then attempt to change the boot image configuration of the cluster.
  3. In Automatic mode, If any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  4. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@djoshy: This pull request references MCO-1877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:

  • The operator now populates the bootImageSkewEnforcementStatus field in the MachineConfiguration object based on spec.bootImageSkewEnforcement, platform defaults and cluster version.
  • The boot image controller will now update the current boot image value in bootImageSkewEnforcementStatus on a successful boot image update. Note that this requires the skew enforcement to be set to Automatic mode, and all machinesets to be opt-ed in for boot image updates.
  • The operator will set Upgradeable=False if the cluster is to be detected to be out of skew. This is done by comparing the boot image values referenced in the bootImageSkewEnforcementStatus field against the MCO's hardcoded skew limits.
  • Some unit tests have been added to sync_test.go and status_test.go to verify the above mechanisms.

Verifying API behavior

This verification will have to be done based on the platform. If the platform:

  • supports boot image updates and it is on by default(AWS and GCP at the time of writing), i.e. status.managedBootImagesStatus is set to All if spec.managedBootImages is empty. Then, skew enforcement status will be set to Automatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCP releaseVersion described by the coreos-bootimages configmap. Here's an example:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • supports boot image updates, but is not on by default(vsphere and Azure at the time of writing) i.e. status.managedBootImagesStatus is set to None if spec.managedBootImages is empty. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 0 of 0 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: None

The admin can choose to opt-in for boot image updates in this case(set spec.ManagedBootImages to All), and the operator should automatically switch the skew enforcement status to Automatic, with the appropriate boot image version. This would mean the object would finally look like this:

 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
   managedBootImages:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
 status:
   bootImageSkewEnforcementStatus:
     automatic:
       ocpVersion: 4.21.0
     mode: Automatic
   conditions:
   - lastTransitionTime: "2025-11-19T22:06:06Z"
     message: Reconciled 3 of 3 MAPI MachineSets | Reconciled 0 of 0 ControlPlaneMachineSets
       | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateProgressing
   - lastTransitionTime: "2025-11-19T22:06:07Z"
     message: 0 Degraded MAPI MachineSets | 0 Degraded ControlPlaneMachineSets |
       0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
     reason: BootImageConfigMapAdded
     status: "False"
     type: BootImageUpdateDegraded
   managedBootImagesStatus:
     machineManagers:
     - apiGroup: machine.openshift.io
       resource: machinesets
       selection:
         mode: All
  • does not support boot image updates(all other platforms at the time of writing) i.e. status.managedBootImagesStatus is empty and spec.managedBootImages cannot be set by the admin. Then, skew enforcement status will be set to Manual, with a boot image version estimated from cluster version. The object would now look like this:
 spec:
   logLevel: Normal
   managementState: Managed
   operatorLogLevel: Normal
 status:
   bootImageSkewEnforcementStatus:
     manual:
       mode: OCPVersion
       ocpVersion: 4.21.0
     mode: Manual

In this case, the admin is expected to manually perform boot image updates and then add a spec field like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2

The operator should then update the status to include this:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: OCPVersion
     ocpVersion: 4.21.2
status:
 bootImageSkewEnforcementStatus:
     mode: OCPVersion
     ocpVersion: 4.21.2

The above snippet is if an admin had chosen to record the OCPVersion. In manual mode, the admin can also choose to to store the RHCOSVersion, like so:

spec:
 bootImageSkewEnforcement:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0
status:
 bootImageSkewEnforcementStatus:
   mode: Manual
   manual:
     mode: RHCOSVersion
     rhcosVersion: 9.0.20251023-0

Note that only one of RHCOSVersion or OCPVersion is permitted in Manual mode.

The admin can also choose to disable skew enforcement altogether by setting it None mode in spec.

spec:
 bootImageSkewEnforcement:
   mode: None
status:
 bootImageSkewEnforcementStatus:
   mode: None

Verifying upgrade block

Upgrades will be blocked when the cluster is to determined out of skew. This mechanism works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in bootImageSkewEnforcementStatus and set Upgradeable=False if necessary. To verify this, first set the mode to Manual with an out of skew boot image version like so:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.0.20251023-0
     mode: Manual

Now, examine the machine-config CO object's conditions field, it should indicate an issue preventing upgrades like so:

$ oc get co machine-config -o yaml
...
 - lastTransitionTime: "2025-11-20T15:15:12Z"
   message: 'Upgrades have been disabled because the cluster is using RHCOS boot
     image version 9.0.20251023-0(RHEL version: 9.0), which is below the minimum
     required RHEL version 9.2. To enable upgrades, please update your boot images
     following the documentation at [TODO: insert link], or disable boot image skew
     enforcement at [TODO: insert link]'
   reason: ClusterBootImageSkewError
   status: "False"
   type: Upgradeable

Next, set the boot image to one within the skew limits:

 spec:
   bootImageSkewEnforcement:
     manual:
  mode: RHCOSVersion
       rhcosVersion: 9.2.20251023-0
     mode: Manual

Then, the Upgradeable condition should be restored back to True

 - lastTransitionTime: "2025-11-20T15:19:25Z"
   reason: AsExpected
   status: "True"
   type: Upgradeable

These set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in Automatic and Manual mode however, as Automatic is only permitted on the status side, I don't think there is an easy way to test that(other than the units I've included).

In None mode, this version check should not take place.

Some caveats to note about Automatic mode:

  1. The admin is not permitted to use Automatic mode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.
  2. In Automatic mode, API validations will prevent changing the boot image configuration to a setting other than All. To change the boot image configuration, the admin is first expected to go to Manual skew enforcement mode and then attempt to change the boot image configuration of the cluster.
  3. In Automatic mode, if any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.
  4. In Automatic mode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy djoshy force-pushed the implement-skew-enforcement branch from dc9203e to 7b578ab Compare November 20, 2025 16:25
@djoshy djoshy marked this pull request as ready for review November 20, 2025 20:55
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2025
@djoshy
Copy link
Contributor Author

djoshy commented Nov 21, 2025

/retest-required

@djoshy djoshy force-pushed the implement-skew-enforcement branch from 7b578ab to dddd5c7 Compare November 21, 2025 16:06
@djoshy
Copy link
Contributor Author

djoshy commented Nov 25, 2025

/retest-required

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 29, 2025
@djoshy djoshy force-pushed the implement-skew-enforcement branch from dddd5c7 to a9597e7 Compare December 1, 2025 13:31
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 1, 2025
@djoshy djoshy force-pushed the implement-skew-enforcement branch from a9597e7 to be70c0c Compare December 2, 2025 14:29
@djoshy djoshy force-pushed the implement-skew-enforcement branch from be70c0c to cbe0fbf Compare December 9, 2025 21:46
@djoshy djoshy force-pushed the implement-skew-enforcement branch 2 times, most recently from ad978bf to a803b27 Compare January 2, 2026 16:08
@djoshy
Copy link
Contributor Author

djoshy commented Jan 2, 2026

Re-rebased to fix all the build issues, should be ready for a pass now 😄

Comment on lines +160 to +162
// Note: Update units in status_test.go when the following are bumped
RHCOSVersionBootImageSkewLimit = "9.2"
OCPVersionBootImageSkewLimit = "4.13.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we remember to bump these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I envision these being updated when the RHEL major is being bumped, so perhaps it'd be a card within the "new" RHEL migration epic. Although, I could see it being a faster cadence if there's some RHEL bugs that can't be fixed easily. Thoughts, @yuqi-zhang ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's confer with the RHCOS team on the exact cadence and definition. For TP I think it's fine to have it hard coded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just make it so that the skew is just N-1 latest supported RHEL/RHCOS for the given stream.

i.e. if latest for this stream is 9.8 based then we'd support the bootimage being set to 9.6, but not 9.4?

It would kind of be nice if this could be dynamically updated (i.e based on set rules similar to what I described above) and then we'd always know, rather than relying on it being hardcoded here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice to have a smaller skew to restrict our test matrix, but one of our main concerns is that it would be too aggressive for customers with non-automatically managed environments. We'd essentially be going from (you can use any bootimage) to (you have to manually update your bootimage and bootimage reference every 3-4 y-streams). So we thought we would start with a more relaxed skew and tighten based on technical concerns.

Happy to discuss more in detail in a call sometime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Might be worth discussing how often we think is too often (in terms of time) and work backwards from there. i.e. I think having the customer do something once a year as part of maintenance wouldn't be a crazy ask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened https://issues.redhat.com/browse/MCO-2104 to track, @yuqi-zhang mentioned bringing this to CoreOS cabal, so will try to bring that to the next one I can join!

This commit adds unit tests for the new Upgradeable guards added in the
previous commit.
This commit ensures that the boot image controller state is acceptable before checking the skew. This check is only done in Automatic mode.
@djoshy djoshy force-pushed the implement-skew-enforcement branch from 88ee5d1 to 2277bea Compare January 30, 2026 19:22
@sergiordlr
Copy link
Contributor

Verified using IPI on AWS, GCP, Azure and Vsphere

Automatic skew was configured by MCO in AWS and GCP and Manual skew was configured by MCO in Azure and Vsphere.

We tested that the right version was used by the skew process by scaling down the CVO and manually editing the history in the clusterversion resource. MCO is correctly reporting the oldest version in clusterversion.status.history in the skew version, and it is correctly updating the value to the latest version when the bootimage cycle is successfully executed.

In #5547 we can see the automation for the tests that were executed to verify this PR (apart from manually hacking the history in clusterversion).

There is a pending test: upgrading a 4.12 cluster up to 4.22. We are working on it, nevertheless it will take some time and should not block this PR. If any problem is found in this test it can be reported as an issue after merging the code.

/verified by @sergiordlr

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 2, 2026
@openshift-ci-robot
Copy link
Contributor

@sergiordlr: This PR has been marked as verified by @sergiordlr.

Details

In response to this:

Verified using IPI on AWS, GCP, Azure and Vsphere

Automatic skew was configured by MCO in AWS and GCP and Manual skew was configured by MCO in Azure and Vsphere.

We tested that the right version was used by the skew process by scaling down the CVO and manually editing the history in the clusterversion resource. MCO is correctly reporting the oldest version in clusterversion.status.history in the skew version, and it is correctly updating the value to the latest version when the bootimage cycle is successfully executed.

In #5547 we can see the automation for the tests that were executed to verify this PR (apart from manually hacking the history in clusterversion).

There is a pending test: upgrading a 4.12 cluster up to 4.22. We are working on it, nevertheless it will take some time and should not block this PR. If any problem is found in this test it can be reported as an issue after merging the code.

/verified by @sergiordlr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@isabella-janssen
Copy link
Member

/lgtm

@djoshy
Copy link
Contributor Author

djoshy commented Feb 2, 2026

/hold

/payload 4.22 nightly blocking

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 2, 2026

@djoshy: trigger 14 job(s) of type blocking for the nightly release of OCP 4.22

  • periodic-ci-openshift-release-master-ci-4.22-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips
  • periodic-ci-openshift-release-master-ci-4.22-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-serial-1of2
  • periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-serial-2of2
  • periodic-ci-openshift-release-master-ci-4.22-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3
  • periodic-ci-openshift-release-master-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3
  • periodic-ci-openshift-release-master-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3
  • periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips-no-nat-instance
  • periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ipi-ovn-ipv4
  • periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/05988970-005c-11f1-89a5-3edfa57d9fb2-0

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 2, 2026
@djoshy
Copy link
Contributor Author

djoshy commented Feb 2, 2026

/test all

@djoshy
Copy link
Contributor Author

djoshy commented Feb 2, 2026

/payload-job periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-azure-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-azure-mco-disruptive-techpreview-2of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive-techpreview-2of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-2of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-vsphere-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-vsphere-mco-disruptive-techpreview-2of2

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 2, 2026

@djoshy: trigger 8 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-azure-mco-disruptive-techpreview-1of2
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-azure-mco-disruptive-techpreview-2of2
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive-techpreview-1of2
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive-techpreview-2of2
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-1of2
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-2of2
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-vsphere-mco-disruptive-techpreview-1of2
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-vsphere-mco-disruptive-techpreview-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/500a0af0-005d-11f1-92e1-26cef23b55b1-0

@sergiordlr
Copy link
Contributor

We verified the skew usin IPI on AWS upgrading from 4.12 to 4.22 and enabling techpreview

The skew version was configured to :

  status:
    bootImageSkewEnforcementStatus:
      automatic:
        ocpVersion: 4.12.84
      mode: Automatic

We see that the cluster is not upgradeable

$ oc adm upgrade
...
Upgradeable=False
...
  * Cluster operator machine-config should not be upgraded between minor or major versions: ClusterBootImageSkewError: Upgrades have been disabled because the cluster is using OCP boot image version 4.12.84, which is below the minimum required version 4.13.0. To enable upgrades, please update your boot images following the documentation at [TODO: insert link], or disable boot image skew enforcement at [TODO: insert link]

The problem was that the ami for 4.20 was recently updated and it was not included in the MCO amis list, hence the controller showed this error

I0203 13:57:23.632096       1 platform_helpers.go:187] current AMI ami-0e0850e74100f0f31 is unknown, skipping update of MachineSet ci-op-l72znhyb-35499-fvns9-worker-us-east-1f
I0203 13:57:23.632118       1 ms_helpers.go:193] No patching required for MAPI machineset ci-op-l72znhyb-35499-fvns9-worker-us-east-1f

Since the bootimage loop could not properly update the images, then the version was not updated.

We manually updated the ami in the machinesets so that they use the last 4.20 ami known by mco, once we did that the update cycle was successfully executed and the skew version reported the right value

  status:
    bootImageSkewEnforcementStatus:
      automatic:
        ocpVersion: 4.22.0
      mode: Automatic

After reporting the new version the cluster stopped reporting that it was not upgradeable because of the versions skew (it is still not upgradeable because it is techpreview, but that's expected).

@djoshy
Copy link
Contributor Author

djoshy commented Feb 3, 2026

Trying some metal jobs, these have historically failed(unrelated to this work), but still would be interesting to see the results if the tests actually run:

/payload-job periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-dualstack-mco-disruptive-techpreview periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv6-mco-disruptive-techpreview periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 3, 2026

@djoshy: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-dualstack-mco-disruptive-techpreview
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv6-mco-disruptive-techpreview
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4a07eb50-0115-11f1-88dd-905896631fcd-0

@djoshy
Copy link
Contributor Author

djoshy commented Feb 4, 2026

/payload-job periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-dualstack-mco-disruptive-techpreview periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv6-mco-disruptive-techpreview periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 4, 2026

@djoshy: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-dualstack-mco-disruptive-techpreview
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv6-mco-disruptive-techpreview
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ecb7d4b0-01cd-11f1-89c2-87ce54c957ad-0

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

I think we've covered most of the main concerns, and we can iterate on some details (e.g. skew limits) as followups since this is still behind TP

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 4, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, isabella-janssen, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [djoshy,isabella-janssen,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@djoshy
Copy link
Contributor Author

djoshy commented Feb 4, 2026

/test all

@djoshy
Copy link
Contributor Author

djoshy commented Feb 4, 2026

/unhold

metal runs look good, no new failures

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 4, 2026
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD c9188a4 and 2 for PR HEAD 2277bea in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 067395e and 1 for PR HEAD 2277bea in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 5, 2026

@djoshy: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit aa8d7e3 into openshift:main Feb 5, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants