MCO-1877: MCO-1879: MCO-1882: MCO-1884: Implement boot image skew enforcement MVP#5428
Conversation
|
@djoshy: This pull request references MCO-1877 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
@djoshy: This pull request references MCO-1877 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@djoshy: This pull request references MCO-1877 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@djoshy: This pull request references MCO-1877 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@djoshy: This pull request references MCO-1877 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@djoshy: This pull request references MCO-1877 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@djoshy: This pull request references MCO-1877 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
dc9203e to
7b578ab
Compare
|
/retest-required |
7b578ab to
dddd5c7
Compare
|
/retest-required |
dddd5c7 to
a9597e7
Compare
a9597e7 to
be70c0c
Compare
be70c0c to
cbe0fbf
Compare
ad978bf to
a803b27
Compare
|
Re-rebased to fix all the build issues, should be ready for a pass now 😄 |
| // Note: Update units in status_test.go when the following are bumped | ||
| RHCOSVersionBootImageSkewLimit = "9.2" | ||
| OCPVersionBootImageSkewLimit = "4.13.0" |
There was a problem hiding this comment.
How will we remember to bump these?
There was a problem hiding this comment.
I envision these being updated when the RHEL major is being bumped, so perhaps it'd be a card within the "new" RHEL migration epic. Although, I could see it being a faster cadence if there's some RHEL bugs that can't be fixed easily. Thoughts, @yuqi-zhang ?
There was a problem hiding this comment.
Let's confer with the RHCOS team on the exact cadence and definition. For TP I think it's fine to have it hard coded.
There was a problem hiding this comment.
Could we just make it so that the skew is just N-1 latest supported RHEL/RHCOS for the given stream.
i.e. if latest for this stream is 9.8 based then we'd support the bootimage being set to 9.6, but not 9.4?
It would kind of be nice if this could be dynamically updated (i.e based on set rules similar to what I described above) and then we'd always know, rather than relying on it being hardcoded here.
There was a problem hiding this comment.
I think it'd be nice to have a smaller skew to restrict our test matrix, but one of our main concerns is that it would be too aggressive for customers with non-automatically managed environments. We'd essentially be going from (you can use any bootimage) to (you have to manually update your bootimage and bootimage reference every 3-4 y-streams). So we thought we would start with a more relaxed skew and tighten based on technical concerns.
Happy to discuss more in detail in a call sometime.
There was a problem hiding this comment.
Yeah. Might be worth discussing how often we think is too often (in terms of time) and work backwards from there. i.e. I think having the customer do something once a year as part of maintenance wouldn't be a crazy ask.
There was a problem hiding this comment.
opened https://issues.redhat.com/browse/MCO-2104 to track, @yuqi-zhang mentioned bringing this to CoreOS cabal, so will try to bring that to the next one I can join!
This commit adds unit tests for the new Upgradeable guards added in the previous commit.
This commit ensures that the boot image controller state is acceptable before checking the skew. This check is only done in Automatic mode.
88ee5d1 to
2277bea
Compare
|
Verified using IPI on AWS, GCP, Azure and Vsphere Automatic skew was configured by MCO in AWS and GCP and Manual skew was configured by MCO in Azure and Vsphere. We tested that the right version was used by the skew process by scaling down the CVO and manually editing the history in the clusterversion resource. MCO is correctly reporting the oldest version in clusterversion.status.history in the skew version, and it is correctly updating the value to the latest version when the bootimage cycle is successfully executed. In #5547 we can see the automation for the tests that were executed to verify this PR (apart from manually hacking the history in clusterversion). There is a pending test: upgrading a 4.12 cluster up to 4.22. We are working on it, nevertheless it will take some time and should not block this PR. If any problem is found in this test it can be reported as an issue after merging the code. /verified by @sergiordlr |
|
@sergiordlr: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
|
/hold /payload 4.22 nightly blocking |
|
@djoshy: trigger 14 job(s) of type blocking for the nightly release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/05988970-005c-11f1-89a5-3edfa57d9fb2-0 |
|
/test all |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-azure-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-azure-mco-disruptive-techpreview-2of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive-techpreview-2of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-2of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-vsphere-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-vsphere-mco-disruptive-techpreview-2of2 |
|
@djoshy: trigger 8 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/500a0af0-005d-11f1-92e1-26cef23b55b1-0 |
|
We verified the skew usin IPI on AWS upgrading from 4.12 to 4.22 and enabling techpreview The skew version was configured to : We see that the cluster is not upgradeable The problem was that the ami for 4.20 was recently updated and it was not included in the MCO amis list, hence the controller showed this error Since the bootimage loop could not properly update the images, then the version was not updated. We manually updated the ami in the machinesets so that they use the last 4.20 ami known by mco, once we did that the update cycle was successfully executed and the skew version reported the right value After reporting the new version the cluster stopped reporting that it was not upgradeable because of the versions skew (it is still not upgradeable because it is techpreview, but that's expected). |
|
Trying some metal jobs, these have historically failed(unrelated to this work), but still would be interesting to see the results if the tests actually run: /payload-job periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-dualstack-mco-disruptive-techpreview periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv6-mco-disruptive-techpreview periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-techpreview |
|
@djoshy: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4a07eb50-0115-11f1-88dd-905896631fcd-0 |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-dualstack-mco-disruptive-techpreview periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv6-mco-disruptive-techpreview periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-techpreview |
|
@djoshy: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ecb7d4b0-01cd-11f1-89c2-87ce54c957ad-0 |
yuqi-zhang
left a comment
There was a problem hiding this comment.
/lgtm
I think we've covered most of the main concerns, and we can iterate on some details (e.g. skew limits) as followups since this is still behind TP
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy, isabella-janssen, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test all |
|
/unhold metal runs look good, no new failures |
|
@djoshy: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR integrates the boot image skew enforcement API introduced in openshift/api#2357. This involves the following changes:
bootImageSkewEnforcementStatusfield in theMachineConfigurationobject based onspec.bootImageSkewEnforcement, platform defaults and cluster version.bootImageSkewEnforcementStatuson a successful boot image update. Note that this requires the skew enforcement to be set toAutomaticmode, and all machinesets to be opt-ed in for boot image updates.sync_test.goandstatus_test.goto verify the above mechanisms.Verifying API behavior
This verification will have to be done based on the platform. If the platform:
status.managedBootImagesStatusis set toAllifspec.managedBootImagesis empty. Then, skew enforcement status will be set toAutomatic, with a boot image version estimated from cluster version. Then, the boot image controller will perform a sync which will update the boot image(if required) and after all resources have been successfully updated, it will update the boot image value stored in the skew enforcement status. The value set will be the OCPreleaseVersiondescribed by thecoreos-bootimagesconfigmap. Here's an example:status.managedBootImagesStatusis set toNoneifspec.managedBootImagesis empty. Then, skew enforcement status will be set toManual, with a boot image version estimated from cluster version. The object would now look like this:The admin can choose to opt-in for boot image updates in this case(set
spec.ManagedBootImagestoAll), and the operator should automatically switch the skew enforcement status toAutomatic, with the appropriate boot image version. This would mean the object would finally look like this:status.managedBootImagesStatusis empty andspec.managedBootImagescannot be set by the admin. Then, skew enforcement status will be set toManual, with a boot image version estimated from cluster version. The object would now look like this:In this case, the admin is expected to manually perform boot image updates and then add a spec field like so:
The operator should then update the status to include this:
The above snippet is if an admin had chosen to record the
OCPVersion. In manual mode, the admin can also choose to to store theRHCOSVersion, like so:Note that only one of RHCOSVersion or OCPVersion is permitted in
Manualmode.The admin can also choose to disable skew enforcement altogether by setting it
Nonemode in spec.Verifying upgrade block
Upgrades will be blocked when the cluster is to determined out of skew. This mechanism works the same way in manual and automatic mode, although it is likely easier to verify in manual mode. The current thresholds for a skew violation is set to when OCP first moved to RHEL9, which corresponds to RHEL version 9.2 and OCP version 4.13.0. The operator will perform semver comparisons of these thresholds against the boot image versions stored in
bootImageSkewEnforcementStatusand setUpgradeable=Falseif necessary. To verify this, first set the mode to Manual with an out of skew boot image version like so:Now, examine the
machine-configCO object's conditions field, it should indicate an issue preventing upgrades like so:Next, set the boot image to one within the skew limits:
Then, the
Upgradeablecondition should be restored back toTrueThese set of steps can be repeated with the OCPVersion specified too. This comparison should only take place in
AutomaticandManualmode however, asAutomaticis only permitted on the status side, I don't think there is an easy way to test that(other than the units I've included).In
Nonemode, this version check should not take place.Some caveats to note about
Automaticmode:Automaticmode within the spec. This is in an intentional choice because only the MCO will always be able to self determine if a platform is eligible for automatic skew enforcement.Automaticmode, API validations will prevent changing the boot image configuration to a setting other thanAll. To change the boot image configuration, the admin is first expected to go toManualskew enforcement mode and then attempt to change the boot image configuration of the cluster.Automaticmode, if any machinesets are skipped for boot image updates(for example a marketplace or an unknown boot image was detected in any of the machinesets), the boot image controller will not update the boot image value stored in bootImageEnforcementStatus. This is because the cluster cannot be considered up to date on boot image if even one of the machine resources are out of skew.Automaticmode, the operator will only populate the OCPVersion. This is because each platform may not have the same RHCOS version of the boot image(for example, across marketplace streams) in a given release, and it would involve a lot of per-platform piping to correctly track the RHCOS version per machineset within the boot image controller. I did not deem this to be worth the effort, but am open to implementing that later if the need arises.