-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add node health status for CLUSTER SLOTS and SHARDS #4767
feat: add node health status for CLUSTER SLOTS and SHARDS #4767
Conversation
auto config = GetShardInfos(cntx); | ||
if (config) { | ||
// we need to remove hiden replicas | ||
auto shards_info = config->Unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do so many unnecessary copies on a relatively large data structure. We copy it here by value and then another time on line 246.
And it's not only here, I was going over cluster code and we seem to copy by value a lot for not good reason (when const& is perfectly fine on those accessors)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we have some resource wasting in cluster code, but it is not important now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there is no reason to copy by value and it's an easy fix so...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config is constant and shouldn't be changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not objecting that. return const&
to avoid copies then 😄
src/server/cluster/cluster_family.cc
Outdated
slot_ranges += shard.slot_ranges.Size(); | ||
auto new_end = std::remove_if(shard.replicas.begin(), shard.replicas.end(), [](const auto& r) { | ||
return r.health == NodeHealth::HIDDEN || r.health == NodeHealth::FAIL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should include LOADING
iin CLUSTER SLOTS
?
We can't have clients connecting to LOADING
replicas as they won't be reachable (so the request will fail)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just a status to be compatible with redis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though the original goal for adding this node health state was to avoid clients connecting to replicas syncing with the master (that aren't reachable in Dragonfly cloud) - which won't be fixed if we include LOADING
replicas in CLUSTER SLOTS
?
it's just a status to be compatible with redis
So should we mark those replicas as hidden then when they aren't yet synced with the master? (In which case we'll never use the loading
state)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You control this info from config. So if you decide that loading state isn't needed you can send hidden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BorysTheDev I think that when the replica is in loading state there are some cluster client commands which should return the node and there are other commands which should not return it. So I believe this logic should be in dragonfly and not in cluster manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the client is using cluster shards command it should see the loading state and it should know not to redirect traffic to it
if the client is using the cluster slots command it should not see the replica if its in loading state
Therefore the fix should be here not to expose loading replicas
case NodeHealth::ONLINE: | ||
return "online"; | ||
case NodeHealth::HIDDEN: | ||
DCHECK(false); // shouldn't be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we shouldn't show it, I've added for consistency
3ac98fd
to
2a7533b
Compare
fixes: #4741