-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix: replica of self #6097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: replica of self #6097
Conversation
Signed-off-by: Kostas Kyrimis <[email protected]>
romange
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why replica would point to itself?
romange
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to understand the flow better, I do not like the fix, but I need first to understand the context before suggesting something different.
I wrote this on the issue. The hypothesis is that somehow they end up calling replicaof on the same node. I replicated both from 1.24 to 1.25 and vice versa, attempted a takeover and added a new replica all without any issues. I was going over the code and I discovered the regression. Then I looked on the logs from the issue And it is the exact same error I saw locally and then it clicked.
They call
I am happy for suggestions. It's important to not prematurely switch the state when we call Start() though. Anything else that is simpler than this I am all eyes :) |
tests/dragonfly/replication_test.py
Outdated
| @dfly_args({"port": 7000}) | ||
| async def test_replica_of_self(async_client): | ||
| with pytest.raises(redis.exceptions.ResponseError): | ||
| await async_client.execute_command("replicaof localhost 6379") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand the reproduction scenario. You try to replicate from non-existing master yet the test is called - replica_of_self.
|
I understand the regression. This is how I would fix it.
|
Unless I am missing something in your proposal what you write is not true. Why do you think any of these will fail if we don't change the We don't want to change the state from active to loading because that's premature and it complicates things and it was one of the goals of replica of v2. You want to reject Greet or Ping (just like it's done now) -- sure, my question is how because we can't rely on server state anymore |
Signed-off-by: Kostas Kyrimis <[email protected]>
src/server/replica.cc
Outdated
| if (!CheckRespIsSimpleReply("OK")) { | ||
| LOG(WARNING) << "Bad REPLCONF CLIENT-ID response"; | ||
| } | ||
| PC_RETURN_ON_BAD_RESPONSE(CheckRespIsSimpleReply("OK")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We log an error "Can't connect to myself" however we don't return that back to the client who sent the replicaof command. We can do this if we replace std::error_code with a custom category which we can translate/provide a custom error msg. Not worth it IMO.
|
@romange plz take a look, much cleaner now 👍 |
tests/dragonfly/replication_test.py
Outdated
| pass | ||
|
|
||
|
|
||
| @dfly_args({"port": 7000}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need specifying port. you can get it via async_client.connection_pool.connection_kwargs["port"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀 👀
src/server/server_family.cc
Outdated
| info->id = arg; | ||
| } | ||
| // If we tried to replicate from ourself reply with an error | ||
| if (arg == master_replid_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, why here, why not in replica inside HandleCapaDflyResp ?
one of the reasons why it is preferable is that during updates, the (old) master does not have this fix, so we will still have the infinite loop. HandleCapaDflyResp is on replica side so you propagate a good behaviour naturally.
Signed-off-by: Kostas Kyrimis <[email protected]>
Signed-off-by: Kostas Kyrimis <[email protected]>
The new replica of algorithm does not prematurely change the state of dragonfly to loading. When
replica ofpoints to self it will try to connect to itself and succeed, change the state to loading and get stuck trying to call replconf on itself (which fails because now !master).The fix is to check if we got connected on the same node and simply do not start replication at all.
Fixes #6091