Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add node health status for CLUSTER SLOTS and SHARDS #4767

Merged
merged 3 commits into from
Mar 17, 2025

Conversation

BorysTheDev
Copy link
Contributor

fixes: #4741

auto config = GetShardInfos(cntx);
if (config) {
// we need to remove hiden replicas
auto shards_info = config->Unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do so many unnecessary copies on a relatively large data structure. We copy it here by value and then another time on line 246.

And it's not only here, I was going over cluster code and we seem to copy by value a lot for not good reason (when const& is perfectly fine on those accessors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we have some resource wasting in cluster code, but it is not important now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there is no reason to copy by value and it's an easy fix so...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config is constant and shouldn't be changed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not objecting that. return const& to avoid copies then 😄

slot_ranges += shard.slot_ranges.Size();
auto new_end = std::remove_if(shard.replicas.begin(), shard.replicas.end(), [](const auto& r) {
return r.health == NodeHealth::HIDDEN || r.health == NodeHealth::FAIL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should include LOADING iin CLUSTER SLOTS?

We can't have clients connecting to LOADING replicas as they won't be reachable (so the request will fail)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just a status to be compatible with redis

Copy link
Contributor

@andydunstall andydunstall Mar 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though the original goal for adding this node health state was to avoid clients connecting to replicas syncing with the master (that aren't reachable in Dragonfly cloud) - which won't be fixed if we include LOADING replicas in CLUSTER SLOTS?

it's just a status to be compatible with redis

So should we mark those replicas as hidden then when they aren't yet synced with the master? (In which case we'll never use the loading state)

Copy link
Contributor Author

@BorysTheDev BorysTheDev Mar 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You control this info from config. So if you decide that loading state isn't needed you can send hidden

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BorysTheDev I think that when the replica is in loading state there are some cluster client commands which should return the node and there are other commands which should not return it. So I believe this logic should be in dragonfly and not in cluster manager

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the client is using cluster shards command it should see the loading state and it should know not to redirect traffic to it
if the client is using the cluster slots command it should not see the replica if its in loading state
Therefore the fix should be here not to expose loading replicas

case NodeHealth::ONLINE:
return "online";
case NodeHealth::HIDDEN:
DCHECK(false); // shouldn't be used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we shouldn't show it, I've added for consistency

@BorysTheDev BorysTheDev force-pushed the feat_add_health_status_to_cluster_shard_cmd branch from 3ac98fd to 2a7533b Compare March 16, 2025 19:01
@BorysTheDev BorysTheDev requested a review from adiholden March 17, 2025 06:26
@BorysTheDev BorysTheDev merged commit 151e40e into main Mar 17, 2025
10 checks passed
@BorysTheDev BorysTheDev deleted the feat_add_health_status_to_cluster_shard_cmd branch March 17, 2025 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support valkey compatible behaviour for both cluster shards/nodes
4 participants