Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with rolling update when some existing replicas are unhealthy #488

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pierewoj
Copy link

@pierewoj pierewoj commented Apr 7, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it

The bug is as follows:

  • assume replicas readiness is [NOT_READY, READY]
  • update is triggered, with MaxUnavailable=1
  • update starts updating the replica-1, causing complete downtime, while the desired behavior would be to wait for replica-0 to become ready

The replica-0 might be unavailable for a variety of reasons like hardware issues, node patching etc.

The fix is to reduce partition less aggressively - taking into account pods that are not ready.

Note that the fix was tested on a cluster with MaxUnavailableStatefulSet feature disabled (I don't have access to a cluster with this feature enabled, maybe the issue is less impactful there, however arguably still beneficial since the lws considers readiness of leader+workers and not only leader).

Which issue(s) this PR fixes

I have not created a dedicated issue for this in github

Special notes for your reviewer

I also refactored the "iteratedReplicas" a bit into separate functions: a) reading the replica state and b) reading the state to get values used for rolling update params computation.

Does this PR introduce a user-facing change?

LWS user will see update progress less aggressively when not all replicas are healthy when update is triggered.

The bug is as follows:
* assume replicas readiness is [NOT_READY, READY]
* update is triggered, with MaxUnavailable=1
* update starts updating the replica-1, causing complete downtime,
  while the desired behavior would be to wait for replica-0 to become ready

The fix is to reduce partition less aggressively - taking into account pods
that are not ready.

Note that the fix was tested on a cluster with MaxUnavailableStatefulSet feature disabled
(I don't have access to a cluster with this feature enabled, maybe the issue is less impactful there,
however arguably still beneficial since the lws considers readiness of leader+workers and not only leader).

Testing:
* integration tests
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 7, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pierewoj
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 7, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @pierewoj!

It looks like this is your first PR to kubernetes-sigs/lws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/lws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 7, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @pierewoj. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 7, 2025
Copy link

netlify bot commented Apr 7, 2025

Deploy Preview for kubernetes-sigs-lws canceled.

Name Link
🔨 Latest commit 358eff5
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-lws/deploys/67f383847229500008e82e6c

@yankay
Copy link
Member

yankay commented Apr 7, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 7, 2025
@Edwinhr716
Copy link
Contributor

/assign @kerthcet since he implemented the original logic

Also, #490 changes the logic behind rolling update a bit, so might need to rebase once that is merged

@congcongke
Copy link
Contributor

What this PR does / why we need it

The bug is as follows:

  • assume replicas readiness is [NOT_READY, READY]
  • update is triggered, with MaxUnavailable=1
  • update starts updating the replica-1, causing complete downtime, while the desired behavior would be to wait for replica-0 to become ready

The replica-0 might be unavailable for a variety of reasons like hardware issues, node patching etc.

The fix is to reduce partition less aggressively - taking into account pods that are not ready.

Note that the fix was tested on a cluster with MaxUnavailableStatefulSet feature disabled (I don't have access to a cluster with this feature enabled, maybe the issue is less impactful there, however arguably still beneficial since the lws considers readiness of leader+workers and not only leader).

@pierewoj
I didn't get the key point.

  • update is rolling, new replicas will be created, right?
  • after new replica ready, the old one will be updated.
  • if the old is not ready forever, the replicas will be keeped burst.

@pierewoj
Copy link
Author

pierewoj commented Apr 8, 2025

What this PR does / why we need it

The bug is as follows:

  • assume replicas readiness is [NOT_READY, READY]
  • update is triggered, with MaxUnavailable=1
  • update starts updating the replica-1, causing complete downtime, while the desired behavior would be to wait for replica-0 to become ready

The replica-0 might be unavailable for a variety of reasons like hardware issues, node patching etc.
The fix is to reduce partition less aggressively - taking into account pods that are not ready.
Note that the fix was tested on a cluster with MaxUnavailableStatefulSet feature disabled (I don't have access to a cluster with this feature enabled, maybe the issue is less impactful there, however arguably still beneficial since the lws considers readiness of leader+workers and not only leader).

@pierewoj I didn't get the key point.

* update is rolling, new replicas will be created, right?

* after new replica ready, the old one will be updated.

* if the old is not ready forever, the replicas will be keeped burst.

Consider maxSurge=0, replicas=2, maxUnavail=1 and cluster with MaxUnavailableStatefulSet disabled. If replicas readiness is [NOT_READY, READY] then the rolling update would bring down the 2nd replica and cause complete unavailability.

@congcongke
Copy link
Contributor

@pierewoj I didn't get the key point.

* update is rolling, new replicas will be created, right?

* after new replica ready, the old one will be updated.

* if the old is not ready forever, the replicas will be keeped burst.

Consider maxSurge=0, replicas=2, maxUnavail=1 and cluster with MaxUnavailableStatefulSet disabled. If replicas readiness is [NOT_READY, READY] then the rolling update would bring down the 2nd replica and cause complete unavailability.

Got it.

Rolling update will be blocked if the actual count of unavailable GE the maxUnavail.

@pierewoj
Copy link
Author

pierewoj commented Apr 9, 2025

Got it.

Rolling update will be blocked if the actual count of unavailable GE the maxUnavail.

Yes:

  • current behavior would be avail drop and update of the replica-1
  • with the fix, the rolling update would wait for the replica-0 to become ready before proceeding to update replica-1 (for example, finish loading the LLM)

@Edwinhr716
Copy link
Contributor

Overall I agree with this change. If I understand correctly, the main issue is that our core logic doesn't take differentiate between a replica not being ready and a replica not being updated (essentially assumes that if the replica is not updated, it means that it must be running) here:

if !(podTemplateHash == revisionKey && podutils.PodRunningAndReady(sortedPods[index])) {

To simplify it, could add an extra count for readyReplicas and return three ints here instead?

func (r *LeaderWorkerSetReconciler) iterateReplicas(ctx context.Context, lws *leaderworkerset.LeaderWorkerSet, stsReplicas int32, revisionKey string) (int32, int32, error) {

@pierewoj
Copy link
Author

pierewoj commented Apr 11, 2025

Overall I agree with this change. If I understand correctly, the main issue is that our core logic doesn't take differentiate between a replica not being ready and a replica not being updated (essentially assumes that if the replica is not updated, it means that it must be running) here:

if !(podTemplateHash == revisionKey && podutils.PodRunningAndReady(sortedPods[index])) {

To simplify it, could add an extra count for readyReplicas and return three ints here instead?

func (r *LeaderWorkerSetReconciler) iterateReplicas(ctx context.Context, lws *leaderworkerset.LeaderWorkerSet, stsReplicas int32, revisionKey string) (int32, int32, error) {

This was my original implementation actually but I think that it is insufficient. See the following scenario:

  • replicas=[NOT_READY, NOT_READY], maxUnavail=1, maxSurge=0
  • existing lws/main algo will compute partition=1 (which is actually ok)
  • proposed simplified algorithm will produce a counter that would move the partition to partition=2 due to replicas not being ready and the update would get stuck
  • this is not desired, since the replicas are not ready anyway so it's safe to update
  • this is why the proposed algorithm in PR has the for loop at L633

@Edwinhr716
Copy link
Contributor

We can keep the proposed algorithm to calculate maxUnavailable, but instead of passing the states struct just pass the readyReplicas value instead. Current implementation requires iterating through the number of replicas four different times. Would be better to just iterate through them twice, and keep the core logic of calculating readyReplicas,continousReady and unreadyReplicas in iterateReplicas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants