Add docs for the new kops reconcile cluster command #17191

rifelpet · 2025-01-09T02:39:23Z

/hold for feedback

A few open questions:

Do we update all the rolling-update cluster docs references to reconcile cluster?
Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes? (with no --instance-group* filtering)
Do we add a new permalink that the error message links to?
Do we mention the new update cluster --reconcile flag? When would a user use it instead of kops reconcile cluster ?

k8s-ci-robot · 2025-01-09T02:39:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rifelpet. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

stl-victor-sudakov · 2025-01-09T09:05:09Z

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

rifelpet · 2025-01-09T12:59:19Z

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

Because updating both the cluster's control plane launch templates (or other cloud provider equivalents) and node launch templates at the same time will cause new nodes to fail to join the cluster until all control plane instances have been upgraded. So if Cluster Autoscaler or Karpenter scale up nodes before or during the control plane rolling-update, they will fail to join and workloads will be stuck in Pending. This is almost certainly not what the user wants and is why we're introducing the new command.

rifelpet · 2025-01-09T13:01:56Z

We could allow the user to bypass the error if they know what they're doing. for example, on clusters that dont use Cluster Autoscaler or Karpenter.

stl-victor-sudakov · 2025-01-09T14:07:06Z

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

Because updating both the cluster's control plane launch templates (or other cloud provider equivalents) and node launch templates at the same time will cause new nodes to fail to join the cluster until all control plane instances have been upgraded. So if Cluster Autoscaler or Karpenter scale up nodes before or during the control plane rolling-update, they will fail to join and workloads will be stuck in Pending. This is almost certainly not what the user wants and is why we're introducing the new command.

Hold on, hasn't this sequence:

kops upgrade cluster $NAME --yes
kops update cluster $NAME --yes
kops rolling-update cluster $NAME --yes

always been the standard upgrade sequence? These steps are even documented in https://kops.sigs.k8s.io/operations/updates_and_upgrades/#automated-update And now it is an error?

rifelpet · 2025-01-09T14:53:31Z

Now it may cause node failures during the k8s 1.31 upgrade, yes. Hence the bold release note being added in this PR and my proposal to prevent users from making this mistake by returning a (skippable) error.

I'll update that docs page to note this change as well.

stl-victor-sudakov · 2025-01-09T15:45:21Z

Now it may cause node failures during the k8s 1.31 upgrade, yes. Hence the bold release note being added in this PR and my proposal to prevent users from making this mistake by returning a (skippable) error.

I'll update that docs page to note this change as well.

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

rifelpet · 2025-01-09T16:03:13Z

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

I updated this PR to link to the k/k issue that goes into more detail: kubernetes/kubernetes#127316

stl-victor-sudakov · 2025-01-10T05:51:37Z

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

I updated this PR to link to the k/k issue that goes into more detail: kubernetes/kubernetes#127316

Oh, what a longread! Maybe #16907 would be shorter and more to the point, it is also mentioned within the longer post.

However I think I understand the innovation now. The new reconcile command does "update --yes && rolling-update --yes" on CP nodes first, and then does the same on worker nodes, thus enforcing that the CP is fully updated first. Is this correct?

danports · 2025-01-10T15:08:07Z

The new reconcile command does "update --yes && rolling-update --yes" on CP nodes first, and then does the same on worker nodes, thus enforcing that the CP is fully updated first. Is this correct?

Yes, that's correct.

docs/tutorial/upgrading-kubernetes.md

danports · 2025-01-10T15:17:34Z

docs/tutorial/upgrading-kubernetes.md


-Upgrading kubernetes is similar to changing the image on an InstanceGroup, except that the kubernetes version is
+Upgrading kubernetes is similar to changing the image on an InstanceGroup, the kubernetes version is


I think the language here was clearer with "except that".

k8s-ci-robot · 2025-01-10T15:18:09Z

@danports: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

danports · 2025-01-10T15:27:24Z

Do we update all the rolling-update cluster docs references to reconcile cluster?

Yes, eventually, but I don't think it needs to happen right away, since reconcile cluster probably needs more time to mature and should support additional options that update cluster/rolling-update cluster already support. See #17146 for instance.

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes? (with no --instance-group* filtering)

That would be a smart idea, though perhaps it should only be a warning if the cluster doesn't have CAS/Karpenter enabled.

Do we add a new permalink that the error message links to?

👍 More context for error messages is always good.

Do we mention the new update cluster --reconcile flag? When would a user use it instead of kops reconcile cluster ?

I am confused about this myself. Based on the commit history I think maybe @justinsb just added the --reconcile flag as an initial step before adding the full reconcile cluster command, so maybe we don't need --reconcile anymore?

justinsb · 2025-01-20T14:40:28Z

I am confused about this myself. Based on the commit history I think maybe @justinsb just added the --reconcile flag as an initial step before adding the full reconcile cluster command, so maybe we don't need --reconcile anymore?

Good point, we can probably just get rid of this now as I don't think we've shipped it in any releases. I don't think there's any reason to keep it? All kops update --reconcile does is "redirect" to the full reconcile command.

Co-authored-by: Dan Ports <[email protected]>

justinsb · 2025-01-20T14:47:21Z

Summarizing what I think we've agreed in office hours discussions:

Do we update all the rolling-update cluster docs references to reconcile cluster?

We think it's too early to "hide" rolling-update cluster, but we should maybe add pointers at key places suggesting that users try "reconcile cluster"

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes? (with no --instance-group* filtering)

We think no, but we might add a warning. The argument for "no" is we break users (and only users with rapidly autoscaling clusters are likely to be affected by the underlying problem). The argument for "yes" is it might be better to break users and get them to consider whether they should switch to "reconcile".

I think we decided to push users to "reconcile" more in kOps 1.32, we're going to treat "reconcile" as a preview for now (kOps 1.31).

Do we add a new permalink that the error message links to?

I don't think we discussed in office hours, but sounds like a good idea in 1.32

Do we mention the new update cluster --reconcile flag? When would a user use it instead of kops reconcile cluster ?

I don't think we discussed in office hours, but sounds like we should remove it.

stl-victor-sudakov · 2025-01-20T14:56:37Z

If you remove the new update cluster --reconcile flag, then will the procedure to update a cluster always consist of two steps? First you run update --yes, and then reconcile ? Would it not be convenient to have just one command to update and reconcile if necessary? On the other hand, the requirement to always run two steps may be more consistent.

UPD Or, on the other hand, if reconcile already implies an update --yes step, then the the new update cluster --reconcile flag is really unnecessary. (all the above is being said from the point of view of a (slightly) confused kOps user).

stl-victor-sudakov · 2025-01-20T15:08:01Z

Right, kops reconcile cluster --yes is a one-stop command that can be used instead of kops update cluster and kops rolling-update cluster. You can still use those commands, but kops reconcile cluster makes it easy to obey the more complicated kubernetes upgrade

Hold on, if I change something by kops edit cluster, a security group or some other cloud property which does not require a rolling-update of nodes, should I use the old update command or the new reconcile command?

k8s-ci-robot · 2025-01-20T15:24:39Z

@rifelpet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kops-e2e-gce-cni-calico	`ae625a6`	link	true	`/test pull-kops-e2e-gce-cni-calico`
pull-kops-e2e-gce-cni-kindnet	`ae625a6`	link	true	`/test pull-kops-e2e-gce-cni-kindnet`
pull-kops-e2e-k8s-aws-amazonvpc-u2404	`0b75448`	link	true	`/test pull-kops-e2e-k8s-aws-amazonvpc-u2404`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

justinsb · 2025-01-20T15:26:53Z

Or, on the other hand, if reconcile already implies an update --yes step, then the the new update cluster --reconcile flag is really unnecessary.

That's right, reconcile == update + rolling-update. It's done by instance group to comply with the "update control plane first" rules. (Technically it's done by instance group role). And yes, I think the --reconcile flag on update is unnecessary and I'll send a PR :-)

We always complied with the rules about "control plane first", just there is an edge case because kops update bumps the ASG versions (on AWS), but normally that new configuration is only rolled out by kops rolling-update cluster. The edge case though is that if the node ASG auto-scales, it will come up with the new version, even if the control-plane has not yet updated. That was tolerated previously, but kube 1.31 now makes it an error which causes problems in rapidly autoscaling clusters.

(all the above is being said from the point of view of a (slightly) confused kOps user).

No worries, we're all trying to figure out what makes the most sense and how best to explain it :-)

justinsb · 2025-01-20T15:40:23Z

I think this PR looks pretty good. I agree that "except that" was probably clearer, but I propose we merge this PR and iterate on the docs.

danports · 2025-01-20T18:33:26Z

docs/tutorial/upgrading-kubernetes.md

+2. `kops rolling-update cluster --instance-group-roles=control-plane,apiserver --yes`
+3. `kops update cluster --yes`
+4. `kops rolling-update cluster --yes`
+5. `kops update cluster --prune --yes`


Suggested change

5. `kops update cluster --prune --yes`

5. `kops update cluster --prune --yes`

My bad, we need an extra line break in here to fix the list formatting.

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025

k8s-ci-robot requested a review from hakman January 9, 2025 02:39

k8s-ci-robot added the area/documentation label Jan 9, 2025

k8s-ci-robot requested a review from johngmyers January 9, 2025 02:39

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 9, 2025

Add docs for the new kops reconcile cluster command

46ea30d

rifelpet force-pushed the reconcile-docs branch from e1ff234 to 46ea30d Compare January 9, 2025 03:00

Add more details and update updates_and_upgrades.md

ae625a6

danports suggested changes Jan 10, 2025

View reviewed changes

justinsb and others added 2 commits January 20, 2025 09:42

Update docs/tutorial/upgrading-kubernetes.md

5ea10ed

Co-authored-by: Dan Ports <[email protected]>

Update docs/tutorial/upgrading-kubernetes.md

0b75448

Co-authored-by: Dan Ports <[email protected]>

danports reviewed Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs for the new kops reconcile cluster command #17191

Add docs for the new kops reconcile cluster command #17191

rifelpet commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

stl-victor-sudakov commented Jan 9, 2025

rifelpet commented Jan 9, 2025

rifelpet commented Jan 9, 2025

stl-victor-sudakov commented Jan 9, 2025

rifelpet commented Jan 9, 2025

stl-victor-sudakov commented Jan 9, 2025

rifelpet commented Jan 9, 2025

stl-victor-sudakov commented Jan 10, 2025 •

edited

Loading

danports commented Jan 10, 2025

danports Jan 10, 2025

k8s-ci-robot commented Jan 10, 2025

danports commented Jan 10, 2025

justinsb commented Jan 20, 2025

justinsb commented Jan 20, 2025

stl-victor-sudakov commented Jan 20, 2025 •

edited

Loading

stl-victor-sudakov commented Jan 20, 2025

k8s-ci-robot commented Jan 20, 2025

justinsb commented Jan 20, 2025

justinsb commented Jan 20, 2025

danports Jan 20, 2025


		Upgrading kubernetes is similar to changing the image on an InstanceGroup, except that the kubernetes version is
		Upgrading kubernetes is similar to changing the image on an InstanceGroup, the kubernetes version is

	5. `kops update cluster --prune --yes`
	5. `kops update cluster --prune --yes`

Add docs for the new kops reconcile cluster command #17191

Are you sure you want to change the base?

Add docs for the new kops reconcile cluster command #17191

Conversation

rifelpet commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

stl-victor-sudakov commented Jan 9, 2025

rifelpet commented Jan 9, 2025

rifelpet commented Jan 9, 2025

stl-victor-sudakov commented Jan 9, 2025

rifelpet commented Jan 9, 2025

stl-victor-sudakov commented Jan 9, 2025

rifelpet commented Jan 9, 2025

stl-victor-sudakov commented Jan 10, 2025 • edited Loading

danports commented Jan 10, 2025

danports Jan 10, 2025

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 10, 2025

danports commented Jan 10, 2025

justinsb commented Jan 20, 2025

justinsb commented Jan 20, 2025

stl-victor-sudakov commented Jan 20, 2025 • edited Loading

stl-victor-sudakov commented Jan 20, 2025

k8s-ci-robot commented Jan 20, 2025

justinsb commented Jan 20, 2025

justinsb commented Jan 20, 2025

danports Jan 20, 2025

Choose a reason for hiding this comment

stl-victor-sudakov commented Jan 10, 2025 •

edited

Loading

stl-victor-sudakov commented Jan 20, 2025 •

edited

Loading