Fix issue with rolling update when some existing replicas are unhealthy #488

pierewoj · 2025-04-07T07:49:21Z

What type of PR is this?

/kind bug

What this PR does / why we need it

The bug is as follows:

assume replicas readiness is [NOT_READY, READY]
update is triggered, with MaxUnavailable=1
update starts updating the replica-1, causing complete downtime, while the desired behavior would be to wait for replica-0 to become ready

The replica-0 might be unavailable for a variety of reasons like hardware issues, node patching etc.

The fix is to reduce partition less aggressively - taking into account pods that are not ready.

Note that the fix was tested on a cluster with MaxUnavailableStatefulSet feature disabled (I don't have access to a cluster with this feature enabled, maybe the issue is less impactful there, however arguably still beneficial since the lws considers readiness of leader+workers and not only leader).

Which issue(s) this PR fixes

I have not created a dedicated issue for this in github

Special notes for your reviewer

I also refactored the "iteratedReplicas" a bit into separate functions: a) reading the replica state and b) reading the state to get values used for rolling update params computation.

Does this PR introduce a user-facing change?

LWS user will see update progress less aggressively when not all replicas are healthy when update is triggered.

The bug is as follows: * assume replicas readiness is [NOT_READY, READY] * update is triggered, with MaxUnavailable=1 * update starts updating the replica-1, causing complete downtime, while the desired behavior would be to wait for replica-0 to become ready The fix is to reduce partition less aggressively - taking into account pods that are not ready. Note that the fix was tested on a cluster with MaxUnavailableStatefulSet feature disabled (I don't have access to a cluster with this feature enabled, maybe the issue is less impactful there, however arguably still beneficial since the lws considers readiness of leader+workers and not only leader). Testing: * integration tests

k8s-ci-robot · 2025-04-07T07:49:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pierewoj
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-04-07T07:49:30Z

Welcome @pierewoj!

It looks like this is your first PR to kubernetes-sigs/lws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/lws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-04-07T07:49:31Z

Hi @pierewoj. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-04-07T07:49:40Z

✅ Deploy Preview for kubernetes-sigs-lws canceled.

Name	Link
🔨 Latest commit	`358eff5`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-lws/deploys/67f383847229500008e82e6c

yankay · 2025-04-07T11:59:00Z

/ok-to-test

Edwinhr716 · 2025-04-07T17:13:11Z

/assign @kerthcet since he implemented the original logic

Also, #490 changes the logic behind rolling update a bit, so might need to rebase once that is merged

congcongke · 2025-04-08T15:02:49Z

What this PR does / why we need it

The bug is as follows:

assume replicas readiness is [NOT_READY, READY]

update is triggered, with MaxUnavailable=1

update starts updating the replica-1, causing complete downtime, while the desired behavior would be to wait for replica-0 to become ready

The replica-0 might be unavailable for a variety of reasons like hardware issues, node patching etc.

The fix is to reduce partition less aggressively - taking into account pods that are not ready.

Note that the fix was tested on a cluster with MaxUnavailableStatefulSet feature disabled (I don't have access to a cluster with this feature enabled, maybe the issue is less impactful there, however arguably still beneficial since the lws considers readiness of leader+workers and not only leader).

@pierewoj
I didn't get the key point.

update is rolling, new replicas will be created, right?
after new replica ready, the old one will be updated.
if the old is not ready forever, the replicas will be keeped burst.

pierewoj · 2025-04-08T20:41:20Z

What this PR does / why we need it

The bug is as follows:

assume replicas readiness is [NOT_READY, READY]

update is triggered, with MaxUnavailable=1

update starts updating the replica-1, causing complete downtime, while the desired behavior would be to wait for replica-0 to become ready

The replica-0 might be unavailable for a variety of reasons like hardware issues, node patching etc.
The fix is to reduce partition less aggressively - taking into account pods that are not ready.
Note that the fix was tested on a cluster with MaxUnavailableStatefulSet feature disabled (I don't have access to a cluster with this feature enabled, maybe the issue is less impactful there, however arguably still beneficial since the lws considers readiness of leader+workers and not only leader).

@pierewoj I didn't get the key point.
* update is rolling, new replicas will be created, right?

* after new replica ready, the old one will be updated.

* if the old is not ready forever, the replicas will be keeped burst.

Consider maxSurge=0, replicas=2, maxUnavail=1 and cluster with MaxUnavailableStatefulSet disabled. If replicas readiness is [NOT_READY, READY] then the rolling update would bring down the 2nd replica and cause complete unavailability.

congcongke · 2025-04-09T04:45:26Z

@pierewoj I didn't get the key point.
* update is rolling, new replicas will be created, right?

* after new replica ready, the old one will be updated.

* if the old is not ready forever, the replicas will be keeped burst.
Consider maxSurge=0, replicas=2, maxUnavail=1 and cluster with MaxUnavailableStatefulSet disabled. If replicas readiness is [NOT_READY, READY] then the rolling update would bring down the 2nd replica and cause complete unavailability.

Got it.

Rolling update will be blocked if the actual count of unavailable GE the maxUnavail.

pierewoj · 2025-04-09T07:22:38Z

Got it.

Rolling update will be blocked if the actual count of unavailable GE the maxUnavail.

Yes:

current behavior would be avail drop and update of the replica-1
with the fix, the rolling update would wait for the replica-0 to become ready before proceeding to update replica-1 (for example, finish loading the LLM)

Edwinhr716 · 2025-04-10T16:54:00Z

Overall I agree with this change. If I understand correctly, the main issue is that our core logic doesn't take differentiate between a replica not being ready and a replica not being updated (essentially assumes that if the replica is not updated, it means that it must be running) here:

lws/pkg/controllers/leaderworkerset_controller.go

Line 558 in cb41374

    
           if !(podTemplateHash == revisionKey && podutils.PodRunningAndReady(sortedPods[index])) {

To simplify it, could add an extra count for readyReplicas and return three ints here instead?

lws/pkg/controllers/leaderworkerset_controller.go

Line 520 in cb41374

    
           func (r *LeaderWorkerSetReconciler) iterateReplicas(ctx context.Context, lws *leaderworkerset.LeaderWorkerSet, stsReplicas int32, revisionKey string) (int32, int32, error) {

pierewoj · 2025-04-11T07:38:01Z

Overall I agree with this change. If I understand correctly, the main issue is that our core logic doesn't take differentiate between a replica not being ready and a replica not being updated (essentially assumes that if the replica is not updated, it means that it must be running) here:

lws/pkg/controllers/leaderworkerset_controller.go

Line 558 in cb41374

if !(podTemplateHash == revisionKey && podutils.PodRunningAndReady(sortedPods[index])) {

To simplify it, could add an extra count for readyReplicas and return three ints here instead?

lws/pkg/controllers/leaderworkerset_controller.go

Line 520 in cb41374

func (r *LeaderWorkerSetReconciler) iterateReplicas(ctx context.Context, lws *leaderworkerset.LeaderWorkerSet, stsReplicas int32, revisionKey string) (int32, int32, error) {

This was my original implementation actually but I think that it is insufficient. See the following scenario:

replicas=[NOT_READY, NOT_READY], maxUnavail=1, maxSurge=0
existing lws/main algo will compute partition=1 (which is actually ok)
proposed simplified algorithm will produce a counter that would move the partition to partition=2 due to replicas not being ready and the update would get stuck
this is not desired, since the replicas are not ready anyway so it's safe to update
this is why the proposed algorithm in PR has the for loop at L633

Edwinhr716 · 2025-04-11T19:54:28Z

We can keep the proposed algorithm to calculate maxUnavailable, but instead of passing the states struct just pass the readyReplicas value instead. Current implementation requires iterating through the number of replicas four different times. Would be better to just iterate through them twice, and keep the core logic of calculating readyReplicas,continousReady and unreadyReplicas in iterateReplicas

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 7, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 7, 2025

k8s-ci-robot requested review from Edwinhr716 and yankay April 7, 2025 07:49

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 7, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 7, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue with rolling update when some existing replicas are unhealthy #488

Fix issue with rolling update when some existing replicas are unhealthy #488

pierewoj commented Apr 7, 2025

k8s-ci-robot commented Apr 7, 2025

k8s-ci-robot commented Apr 7, 2025

k8s-ci-robot commented Apr 7, 2025

netlify bot commented Apr 7, 2025 •

edited

Loading

yankay commented Apr 7, 2025

Edwinhr716 commented Apr 7, 2025

congcongke commented Apr 8, 2025

What this PR does / why we need it

pierewoj commented Apr 8, 2025

What this PR does / why we need it

congcongke commented Apr 9, 2025

pierewoj commented Apr 9, 2025

Edwinhr716 commented Apr 10, 2025

pierewoj commented Apr 11, 2025 •

edited

Loading

Edwinhr716 commented Apr 11, 2025

Fix issue with rolling update when some existing replicas are unhealthy #488

Are you sure you want to change the base?

Fix issue with rolling update when some existing replicas are unhealthy #488

Conversation

pierewoj commented Apr 7, 2025

What type of PR is this?

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Does this PR introduce a user-facing change?

k8s-ci-robot commented Apr 7, 2025

k8s-ci-robot commented Apr 7, 2025

k8s-ci-robot commented Apr 7, 2025

netlify bot commented Apr 7, 2025 • edited Loading

✅ Deploy Preview for kubernetes-sigs-lws canceled.

yankay commented Apr 7, 2025

Edwinhr716 commented Apr 7, 2025

congcongke commented Apr 8, 2025

What this PR does / why we need it

pierewoj commented Apr 8, 2025

What this PR does / why we need it

congcongke commented Apr 9, 2025

pierewoj commented Apr 9, 2025

Edwinhr716 commented Apr 10, 2025

pierewoj commented Apr 11, 2025 • edited Loading

Edwinhr716 commented Apr 11, 2025

netlify bot commented Apr 7, 2025 •

edited

Loading

pierewoj commented Apr 11, 2025 •

edited

Loading