-
Notifications
You must be signed in to change notification settings - Fork 562
[WIP] MCO-1669: add BootImageSkewEnforcement API #2357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@djoshy: This pull request references MCO-1669 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Hello @djoshy! Some important instructions when contributing to openshift/api: |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: djoshy The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks for the questions & review(sorry it took a while!), this should be ready for another look. Happy to hop on a call if that is easier. Update: Did another push to fix up some tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving another handful of comments.
I'm also happy to hop on a call if you think it would be beneficial.
// clusterBootImage describes the current boot image of the cluster. This will be used to enforce the skew limit. | ||
// This value will be compared against the cluster's skew limit to determine skew compliance. | ||
// Required when mode is set to "Automatic" or "Manual" and forbidden otherwise. | ||
// +optional | ||
ClusterBootImage *ClusterBootImage `json:"clusterBootImage,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, discriminated union members will follow the name of their mode
counterpart. i.e:
mode: Automatic
automatic:
...
or
mode: Manual
manual:
...
Another thing I'm curious about now that I've got a bit more context - why do you want to require the clusterBootImage
when set to Automatic
?
Presumably, if the MCO is able to determine the cluster boot image by itself should it just do it and perform the skew handling automatically?
If a user were to explicitly set Automatic
, I imagine they are wanting to have MCO handle all of that and that they likely don't have the cluster boot image information on hand. Whereas if they set Manual
they are explicitly stating they want to manually manage that information and I would expect them to have it on hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. The determination of boot image version isn't straightforward and varies wildly per platform. There currently isn't a single source of truth for the admin or the controllers to use in the cluster. So I thought this would be a good way to represent that information in the API. Perhaps for the Automatic case; clusterBootImage
makes more sense as a Status only field, but for Manual we could have it in Spec
and Status
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with this Spec/Status shape, but Automatic
only specifies a version in the status version. So I'm envisioning something like the following examples.
On a cluster that defaults into Automatic mode(no admin opinion):
spec:
..
status:
bootImageSkewEnforcementStatus:
mode: Automatic
automatic:
ocpVersion: "4.18.2"
rhcosVersion: "9.6.20250523-1"
On a cluster that defaults into manual mode(no admin opinion):
spec:
..
status:
bootImageSkewEnforcementStatus:
mode: Manual
manual:
ocpVersion: "4.18.2"
On a cluster that an admin explicitly sets to Manual, and performs updates:
spec:
bootImageSkewEnforcement:
mode: Manual
manual:
ocpVersion: "4.18.2"
rhcosVersion: "9.6.20250523-1"
status:
bootImageSkewEnforcementStatus:
mode: Manual
manual:
ocpVersion: "4.18.2"
rhcosVersion: "9.6.20250523-1"
On a cluster that an admin disables this feature:
spec:
bootImageSkewEnforcement:
mode: None
status:
bootImageSkewEnforcementStatus:
mode: None
On a cluster that an admin explicitly sets to Automatic:
spec:
bootImageSkewEnforcement:
mode: Automatic
status:
bootImageSkewEnforcementStatus:
mode: Automatic
automatic:
ocpVersion: "4.18.2"
rhcosVersion: "9.6.20250523-1"
For this last case, I'm not entirely convinced if it needs to be supported. The user having the power to go to "Manual" and "None" via an explicit value makes sense.
Hmm, I guess a workflow to consider for this would be a user going from Manual/None to Automatic mode; would deleting the spec.bootImageSkewEnforcement
be good UX that case? If the MCO is able to automatically determine that the cluster is able to perform skew management in a hands off fashion, it would default the status to Automatic(if spec is empty). Or would it be better to have an explicit Automatic
setting in the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments, but other than these I think this looks good
// The default for mode is Automatic for clusters that support automatic boot image updates and | ||
// Manual for clusters that do not support automatic boot image updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I know whether or not my cluster supports automatic boot image updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So on cluster, you'll be able to see this by checking the MachineConfiguration
CR. When spec.managedBootImages
is undefined, we interpret that as no opinion from the admin and status.managedBootImagesStatus
will reflect the default boot image configuration(platform dependent as I mentioned earlier). managedBootImagesStatus
may
- contain a
MachineManager
set to "All". This is what I refer to as automatic/opted-in. - contain a
MachineManager
set to "None". This is a platform that does support boot image updates; but not by default. The admin has to opt-in via defining a machineManager inspec.managedBootImages
. We may move this to default in a later release once we have gained enough confidence in the platform. - be undefined. This means a platform that we have not yet explored for boot image updates. The admin is prevented from adding to
spec.managedBootImages
for these cases via a validating admission policy.
Other than that, we hope to socialize new platforms we add support/default for via documentation, KB articles and such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the UX fully here, so it would be helpful if you can walk me through this again. At some point a user will upgrade to a version with this skew enforcement API in place. Let's break them down into a few categories:
- I'm on cloud and I just want automatic to be on -> in this case, assuming they haven't fiddled with the default-on bootimage management fields, they should not need to do anything. The spec will be empty and the status will be set by the MCO to automatic (since we detect they are in a managed case, right?)
- I'm on cloud but with custom bootimages -> in this case, I guess the MCO doesn't know what to do with the existing image since it's not in any stream, so we default to
None
or just have no status? What happens now, do we alert the user and/or prevent scaling? - I'm on-prem with manual bootimage -> in this case, I would not have been able to set these fields until they were turned on, so I would need to upgrade first, then set
manual
here, and until that happens, the status is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No dumb questions, these are great! Before I answer, I want to explain how I think about how the MCO would detect the current cluster boot image(I'm open to other ideas/strategies here!):
- If boot image updates are enabled; and all machine resources have been successfully reconciled, we could use the current cluster version & the rhcos version from the coreos-bootimages configmap to determine the current boot image.
- If boot image updates are only partially enabled or completely disabled; and machine resources can't be inspected to determine the boot image version, we would have to assume the boot image to be the cluster's installation version from the version history. For most platforms, we will very likely have to use latter method.
Now to your questions:
-
Yes, this is correct.
-
For custom boot images in the cloud platforms, the MCO currently (silently) skips over updating them. We have a couple of paths here:
- Once we've covered all the marketplace/manageable scenarios for cloud platforms, we can add a degrade/error when the MCO encounters a custom boot image during an update. I expect most of these custom boot image users to disable boot image updates to fix the error. When we do enable skew enforcement(SE) in this case, the cluster would immediately be considered out of skew as the current boot image can't be estimated. If they leave the error in place and not disable boot image updates, SE should also consider a cluster to be in that error mode be out of skew.
- If we choose to not have an error mode for the custom boot image case; we will still want add a method to indicate to the MCO that machine resources have been skipped over and not reconciled. With that in place, when skew enforcement is enabled, the MCO would be able to determine that the cluster is out of skew.
In both of these paths, the admin would be expected to manually switch SE spec to
Manual
(and indicate their boot image) or toNone
depending on their scaling needs. Since this case involves a cloud env that does likely support automatic boot image updates, the SE status would initially be set to Automatic by the MCO, and as stated earlier, the detected cluster boot image. -
For any case where the MCO cannot automatically manage boot images, the MCO will default to
Manual
mode in the SE status, along with the cluster's boot image. The admin would be expected to manually switch SE spec toManual
(and indicate their boot image) or toNone
depending on their scaling needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so we will try to set some value in the status ourselves based on our interpretation of the cluster, and error if we cannot. The user can set a spec as they wish to override it (so in the second example, if I used some unfound bootimage, I could still manually set spec to automatic to allow the MCO to go ahead and ignore the error, is that correct?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Automatic clusters, most cases would need no intervention; except if the admin has manually set a custom boot image. I don't think this is very common(cc @yuqi-zhang in case I'm wrong here)
Based on our discussions, I think my understanding is that if the user has set a custom bootimage, our mechanisms will not find it in the streams (regular, managed (ARO/ROSA), marketplace (OPP, OKE, OCP)) so we would end up defaulting to manual and require user intervention, right? Or are you saying that all cloud platforms would default to automatic, and if the user somehow has set their own value, we'd instead just error and let the user sort it out?
There's a few things we discussed in this thread, but to recap some of the main open points:
- spec vs. status and defaulting - we will only default to "automatic" or "manual" if the user doesn't input, and the contentious point is that by setting it to "manual" by default, we'd need immediate user intervention
- the mechanism of alerting the user of actions/interventions - whether we set upgradeable to false, or some other mechanism
Is that accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the contentious point is that by setting it to "manual" by default, we'd need immediate user intervention
I think the contentious point is that this seems like a setting that admins need to be explicitly aware of, and need to make a decision on, if we cannot automatically determine the state this should be put into. Otherwise, they may not be able to upgrade and/or scale as they are used to (is my understanding of this correct?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the criteria by which the MCO will determine whether a cluster is automatic or manual? Was there a decision tree in the EP that covers this?
- Platform - Some platforms are never automatic?
- Whether the existing boot images are known? Someone using a custom boot image doesn't get updates?
- ???
Do I need a managedBootImages
configuration for the cluster to be automatic
?
What would happen if an admin set this to Automatic on a cluster that didn't support automatic updates? Is that theoretically possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking on the update flow.
If we can determine that the cluster is in a suitable condition to be adopted for automatic updates, I think defaulting status to automatic makes sense.
If we cannot determine that the cluster is suitable for automatic updates, there's a mention in the thread that we will just use the version from the cluster image, how does that work? How can we confident that this image is correct? Maybe I'm missing what you mean there?
If we cannot automatically ascertain the version figures, how would we populate the manual
spec field with either the RHCOS or OCP versions? Wouldn't we need the admin to make an explicit choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the description workflows based on our discussion, PTAL to confirm we're all on the same page. I think we also settled on "Automatic" not being settable from the spec side, but I wanted to double check that before updating the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some tests for the status field as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, PTAL when you have a sec!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of maybe dumb questions:
// The default for mode is Automatic for clusters that support automatic boot image updates and | ||
// Manual for clusters that do not support automatic boot image updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the UX fully here, so it would be helpful if you can walk me through this again. At some point a user will upgrade to a version with this skew enforcement API in place. Let's break them down into a few categories:
- I'm on cloud and I just want automatic to be on -> in this case, assuming they haven't fiddled with the default-on bootimage management fields, they should not need to do anything. The spec will be empty and the status will be set by the MCO to automatic (since we detect they are in a managed case, right?)
- I'm on cloud but with custom bootimages -> in this case, I guess the MCO doesn't know what to do with the existing image since it's not in any stream, so we default to
None
or just have no status? What happens now, do we alert the user and/or prevent scaling? - I'm on-prem with manual bootimage -> in this case, I would not have been able to set these fields until they were turned on, so I would need to upgrade first, then set
manual
here, and until that happens, the status is empty?
// +kubebuilder:validation:MaxLength:=21 | ||
// +kubebuilder:validation:MinLength:=14 | ||
// +optional | ||
RHCOSVersion string `json:"rhcosVersion,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what scenarios do you expect the user to explicitly set this? The one I thought of was: I'm not sure what OCP version my RHCOS corresponds to, or I have some type of custom bootimage.
In that case, wouldn't it be more helpful to have one of OCPVersion or RHCOSVersion be required, and if one is set, you don't have to set the other? But you can set both if you want more strict version checking? (is that supported via API validation?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this a valid use case, and should be possible as a validation at the parent level. I'll try updating it to do this.
// The default for mode is Automatic for clusters that support automatic boot image updates and | ||
// Manual for clusters that do not support automatic boot image updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the criteria by which the MCO will determine whether a cluster is automatic or manual? Was there a decision tree in the EP that covers this?
- Platform - Some platforms are never automatic?
- Whether the existing boot images are known? Someone using a custom boot image doesn't get updates?
- ???
Do I need a managedBootImages
configuration for the cluster to be automatic
?
What would happen if an admin set this to Automatic on a cluster that didn't support automatic updates? Is that theoretically possible?
// None means that the MCO will no longer monitor the boot image skew. This may affect | ||
// the cluster's ability to scale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can expand on this a little, probably useful to explain that this means the cluster has no way to understand the compatibility between X and Y, where X and Y are the boot image and the ignition/pivot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, will do!
// The default for mode is Automatic for clusters that support automatic boot image updates and | ||
// Manual for clusters that do not support automatic boot image updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking on the update flow.
If we can determine that the cluster is in a suitable condition to be adopted for automatic updates, I think defaulting status to automatic makes sense.
If we cannot determine that the cluster is suitable for automatic updates, there's a mention in the thread that we will just use the version from the cluster image, how does that work? How can we confident that this image is correct? Maybe I'm missing what you mean there?
If we cannot automatically ascertain the version figures, how would we populate the manual
spec field with either the RHCOS or OCP versions? Wouldn't we need the admin to make an explicit choice?
@@ -55,8 +53,123 @@ type MachineConfigurationSpec struct { | |||
// has no effect on cluster upgrades which will still incur node disruption where required. | |||
// +optional | |||
NodeDisruptionPolicy NodeDisruptionPolicyConfig `json:"nodeDisruptionPolicy"` | |||
|
|||
// bootImageSkewEnforcement is an optional field that can be used to configure how version skew is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a relationship between this new field and the managed boot images that we need to enforce?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the only case I can think of enforcing Automatic and boot image updates being enabled; but if we are making Automatic a status only enum, I'm not sure its necessary.
In platforms where we do not support boot image updates via the MCO, VAPs are in place to prevent setting the managed boot images field based on the infra object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a bit more...Manual
/None
while having ManagedBootImages
set to update all machine resources might be a bit strange to do, perhaps we can guard against that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be strange yes, are there use cases though? In particular I could imagine a None
being set with automatic, we shouldn't need image skew enforcement if we are on automatic as the auto will fix it for us, and therefore someone may want to turn the warnings off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I sketched out a table to figure out the supported combinations:
Boot Images | Skew Enforcement | Result |
---|---|---|
All | Auto | Good |
All | Manual | Error |
All | None | Good |
All | Empty | Good |
Partial | Auto | Error |
Partial | Manual | Error? |
Partial | None | Good |
Partial | Empty | Good |
None | Auto | Error |
None | Manual | Good |
None | None | Good |
None | Empty | Good |
Empty | Auto | Error |
Empty | Manual | Good |
Empty | None | Good |
Empty | Empty | Good |
(This only applies to clusters that support boot image updates via the MCO, the other platforms would not permit editing of the ManagedBootImages
field via the VAP)
In particular I could imagine a None being set with automatic, we shouldn't need image skew enforcement if we are on automatic as the auto will fix it for us
Hmm, are you suggesting that Automatic should imply that the boot image controller to disregard the values in ManagedBootImages knob and behave as it should update all resources? 🤔
@djoshy: This pull request references MCO-1669 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@djoshy: This pull request references MCO-1669 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@djoshy: This pull request references MCO-1669 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@djoshy: This pull request references MCO-1669 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
66a9ee8
to
ced817a
Compare
ced817a
to
8181cc5
Compare
Last push includes:
It seems to be failing |
@djoshy: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@djoshy The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another look at this and from what I recall this aligns with what was discussed synchronously. I'm going to circle back around to see if there are any outstanding comments left from Joel's previous review while I was out, but otherwise this direction looks good to me.
Sounds good, thanks! In the last API office hours, Joel had asked me to take a shot at figuring out cross validations between the ManagedBootImages & SkewEnforcement fields. I've drafted a validation table and linked it in the API office hours doc so we can talk through the scenarios in the next meeting. It did get a teeny bit complicated when you take spec/status into account 😅 |
Sounds good! I took a look at that validation table, but lets take the next office hours to discuss that table and what each state means for an end user before moving forward with this as is. Thanks for letting me know about that table! |
Based on discussions from openshift/enhancements#1761:
Workflow for pre-release skew enforcement not active
Workflow for release n, skew enforcement is active