Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shard-manager: only allow alive pods to register #146

Merged

Conversation

yoohaemin
Copy link
Contributor

@yoohaemin yoohaemin commented Oct 16, 2024

Currently, a pod that is not healthy can still be registered to shard-manager, but will be deregistered immediately following a healthcheck. This is not desirable as it can cause rapid rebalancing of many shards, and in the worst case, service disruption when the api pods do not get the unassignment notification.

This PR aims to prevent this by not allowing assignment of shards when the shard manager does not find the pod healthy.

Background:

As discussed here: https://discord.com/channels/629491597070827530/1017210544295522325/1296046997601652799

On staging env, we had the problem where shard-manager was looking at a reused ip address. I added the label selector on shard-manager and deployed to prod. The app seemingly behaved normally, but the label selector was wrong, so it was not getting any pods.
What it actually did was: because there were no pods, each app got all shard assigned and then (after shard-manager couldn't find the pod on k8s) immediately unassigned all shards, but the app didn't see that unassign for some reason. Each of the pods thought they held all shards, and it caused some data corruption.
I think I can contribute: on registration, shard-manager checks if the pod is alive first, and refuse to assign any shard in case it doesn't see the pod. In this case, the label mistake will have blown up on staging first, so the bug wouldn't have sneaked into prod.

Copy link
Collaborator

@ghostdogpr ghostdogpr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@ghostdogpr ghostdogpr merged commit ac770c8 into devsisters:series/2.x Oct 17, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants