Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for the new kops reconcile cluster command #17191

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

rifelpet
Copy link
Member

@rifelpet rifelpet commented Jan 9, 2025

/hold for feedback

A few open questions:

  1. Do we update all the rolling-update cluster docs references to reconcile cluster?
  2. Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes? (with no --instance-group* filtering)
  3. Do we add a new permalink that the error message links to?
  4. Do we mention the new update cluster --reconcile flag? When would a user use it instead of kops reconcile cluster ?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rifelpet. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from hakman January 9, 2025 02:39
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 9, 2025
@stl-victor-sudakov
Copy link

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

@rifelpet
Copy link
Member Author

rifelpet commented Jan 9, 2025

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

Because updating both the cluster's control plane launch templates (or other cloud provider equivalents) and node launch templates at the same time will cause new nodes to fail to join the cluster until all control plane instances have been upgraded. So if Cluster Autoscaler or Karpenter scale up nodes before or during the control plane rolling-update, they will fail to join and workloads will be stuck in Pending. This is almost certainly not what the user wants and is why we're introducing the new command.

@rifelpet
Copy link
Member Author

rifelpet commented Jan 9, 2025

We could allow the user to bypass the error if they know what they're doing. for example, on clusters that dont use Cluster Autoscaler or Karpenter.

@stl-victor-sudakov
Copy link

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

Because updating both the cluster's control plane launch templates (or other cloud provider equivalents) and node launch templates at the same time will cause new nodes to fail to join the cluster until all control plane instances have been upgraded. So if Cluster Autoscaler or Karpenter scale up nodes before or during the control plane rolling-update, they will fail to join and workloads will be stuck in Pending. This is almost certainly not what the user wants and is why we're introducing the new command.

Hold on, hasn't this sequence:

kops upgrade cluster $NAME --yes
kops update cluster $NAME --yes
kops rolling-update cluster $NAME --yes

always been the standard upgrade sequence? These steps are even documented in https://kops.sigs.k8s.io/operations/updates_and_upgrades/#automated-update And now it is an error?

@rifelpet
Copy link
Member Author

rifelpet commented Jan 9, 2025

Now it may cause node failures during the k8s 1.31 upgrade, yes. Hence the bold release note being added in this PR and my proposal to prevent users from making this mistake by returning a (skippable) error.

I'll update that docs page to note this change as well.

@stl-victor-sudakov
Copy link

Now it may cause node failures during the k8s 1.31 upgrade, yes. Hence the bold release note being added in this PR and my proposal to prevent users from making this mistake by returning a (skippable) error.

I'll update that docs page to note this change as well.

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

@rifelpet
Copy link
Member Author

rifelpet commented Jan 9, 2025

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

I updated this PR to link to the k/k issue that goes into more detail: kubernetes/kubernetes#127316

@stl-victor-sudakov
Copy link

stl-victor-sudakov commented Jan 10, 2025

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

I updated this PR to link to the k/k issue that goes into more detail: kubernetes/kubernetes#127316

Oh, what a longread! Maybe #16907 would be shorter and more to the point, it is also mentioned within the longer post.

However I think I understand the innovation now. The new reconcile command does "update --yes && rolling-update --yes" on CP nodes first, and then does the same on worker nodes, thus enforcing that the CP is fully updated first. Is this correct?

@danports
Copy link
Contributor

The new reconcile command does "update --yes && rolling-update --yes" on CP nodes first, and then does the same on worker nodes, thus enforcing that the CP is fully updated first. Is this correct?

Yes, that's correct.

docs/tutorial/upgrading-kubernetes.md Outdated Show resolved Hide resolved
docs/tutorial/upgrading-kubernetes.md Outdated Show resolved Hide resolved

Upgrading kubernetes is similar to changing the image on an InstanceGroup, except that the kubernetes version is
Upgrading kubernetes is similar to changing the image on an InstanceGroup, the kubernetes version is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the language here was clearer with "except that".

@k8s-ci-robot
Copy link
Contributor

@danports: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@danports
Copy link
Contributor

  1. Do we update all the rolling-update cluster docs references to reconcile cluster?

Yes, eventually, but I don't think it needs to happen right away, since reconcile cluster probably needs more time to mature and should support additional options that update cluster/rolling-update cluster already support. See #17146 for instance.

  1. Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes? (with no --instance-group* filtering)

That would be a smart idea, though perhaps it should only be a warning if the cluster doesn't have CAS/Karpenter enabled.

  1. Do we add a new permalink that the error message links to?

👍 More context for error messages is always good.

  1. Do we mention the new update cluster --reconcile flag? When would a user use it instead of kops reconcile cluster ?

I am confused about this myself. Based on the commit history I think maybe @justinsb just added the --reconcile flag as an initial step before adding the full reconcile cluster command, so maybe we don't need --reconcile anymore?

@justinsb
Copy link
Member

I am confused about this myself. Based on the commit history I think maybe @justinsb just added the --reconcile flag as an initial step before adding the full reconcile cluster command, so maybe we don't need --reconcile anymore?

Good point, we can probably just get rid of this now as I don't think we've shipped it in any releases. I don't think there's any reason to keep it? All kops update --reconcile does is "redirect" to the full reconcile command.

@justinsb
Copy link
Member

Summarizing what I think we've agreed in office hours discussions:

Do we update all the rolling-update cluster docs references to reconcile cluster?

We think it's too early to "hide" rolling-update cluster, but we should maybe add pointers at key places suggesting that users try "reconcile cluster"

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes? (with no --instance-group* filtering)

We think no, but we might add a warning. The argument for "no" is we break users (and only users with rapidly autoscaling clusters are likely to be affected by the underlying problem). The argument for "yes" is it might be better to break users and get them to consider whether they should switch to "reconcile".

I think we decided to push users to "reconcile" more in kOps 1.32, we're going to treat "reconcile" as a preview for now (kOps 1.31).

Do we add a new permalink that the error message links to?

I don't think we discussed in office hours, but sounds like a good idea in 1.32

Do we mention the new update cluster --reconcile flag? When would a user use it instead of kops reconcile cluster ?

I don't think we discussed in office hours, but sounds like we should remove it.

@stl-victor-sudakov
Copy link

stl-victor-sudakov commented Jan 20, 2025

If you remove the new update cluster --reconcile flag, then will the procedure to update a cluster always consist of two steps? First you run update --yes, and then reconcile ? Would it not be convenient to have just one command to update and reconcile if necessary? On the other hand, the requirement to always run two steps may be more consistent.

UPD Or, on the other hand, if reconcile already implies an update --yes step, then the the new update cluster --reconcile flag is really unnecessary. (all the above is being said from the point of view of a (slightly) confused kOps user).

@stl-victor-sudakov
Copy link

Right, kops reconcile cluster --yes is a one-stop command that can be used instead of kops update cluster and kops rolling-update cluster. You can still use those commands, but kops reconcile cluster makes it easy to obey the more complicated kubernetes upgrade

Hold on, if I change something by kops edit cluster, a security group or some other cloud property which does not require a rolling-update of nodes, should I use the old update command or the new reconcile command?

@k8s-ci-robot
Copy link
Contributor

@rifelpet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kops-e2e-gce-cni-calico ae625a6 link true /test pull-kops-e2e-gce-cni-calico
pull-kops-e2e-gce-cni-kindnet ae625a6 link true /test pull-kops-e2e-gce-cni-kindnet
pull-kops-e2e-k8s-aws-amazonvpc-u2404 0b75448 link true /test pull-kops-e2e-k8s-aws-amazonvpc-u2404

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@justinsb
Copy link
Member

Or, on the other hand, if reconcile already implies an update --yes step, then the the new update cluster --reconcile flag is really unnecessary.

That's right, reconcile == update + rolling-update. It's done by instance group to comply with the "update control plane first" rules. (Technically it's done by instance group role). And yes, I think the --reconcile flag on update is unnecessary and I'll send a PR :-)

We always complied with the rules about "control plane first", just there is an edge case because kops update bumps the ASG versions (on AWS), but normally that new configuration is only rolled out by kops rolling-update cluster. The edge case though is that if the node ASG auto-scales, it will come up with the new version, even if the control-plane has not yet updated. That was tolerated previously, but kube 1.31 now makes it an error which causes problems in rapidly autoscaling clusters.

(all the above is being said from the point of view of a (slightly) confused kOps user).

No worries, we're all trying to figure out what makes the most sense and how best to explain it :-)

@justinsb
Copy link
Member

I think this PR looks pretty good. I agree that "except that" was probably clearer, but I propose we merge this PR and iterate on the docs.

2. `kops rolling-update cluster --instance-group-roles=control-plane,apiserver --yes`
3. `kops update cluster --yes`
4. `kops rolling-update cluster --yes`
5. `kops update cluster --prune --yes`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. `kops update cluster --prune --yes`
5. `kops update cluster --prune --yes`

My bad, we need an extra line break in here to fix the list formatting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants