fix: replica of self #6097

kostasrim · 2025-11-21T13:00:00Z

The new replica of algorithm does not prematurely change the state of dragonfly to loading. When replica of points to self it will try to connect to itself and succeed, change the state to loading and get stuck trying to call replconf on itself (which fails because now !master).

The fix is to check if we got connected on the same node and simply do not start replication at all.

Fixes #6091

Signed-off-by: Kostas Kyrimis <[email protected]>

romange

Why replica would point to itself?

romange

I need to understand the flow better, I do not like the fix, but I need first to understand the context before suggesting something different.

kostasrim · 2025-11-21T18:01:00Z

Why replica would point to itself?

I wrote this on the issue. The hypothesis is that somehow they end up calling replicaof on the same node. I replicated both from 1.24 to 1.25 and vice versa, attempted a takeover and added a new replica all without any issues. I was going over the code and I discovered the regression. Then I looked on the logs from the issue

W20251119 12 main_service.cc:1672]  REPLCONF listening-port 6379 failed with reason: Replicating a replica is unsupported

And it is the exact same error I saw locally and then it clicked.

I need to understand the flow better,

They call replicaof self_host self_port. replica_->Start() passes without an error. So now the node is connected to itself. We switch the state from master to replica. Now we start initializing the flows which call REPLCONF and they get back " REPLCONF listening-port 6379 failed with reason: Replicating a replica is unsupported" and this gets stuck in a loop.

I do not like the fix, but I need first to understand the context before suggesting something different.

I am happy for suggestions. It's important to not prematurely switch the state when we call Start() though. Anything else that is simpler than this I am all eyes :)

romange · 2025-11-23T07:30:04Z

tests/dragonfly/replication_test.py

+@dfly_args({"port": 7000})
+async def test_replica_of_self(async_client):
+    with pytest.raises(redis.exceptions.ResponseError):
+        await async_client.execute_command("replicaof localhost 6379")


I do not understand the reproduction scenario. You try to replicate from non-existing master yet the test is called - replica_of_self.

romange · 2025-11-23T08:11:04Z

I understand the regression. This is how I would fix it.

Simple way: ReplicaOfInternalV2 calls Start which in turn calls Greet() if Greet fails with Replicating a replica is unsupported stop the flow and do not launch the asynch flow that goes into the inifinite cycle of retrying. Revert the flow as if you would do if trying to connect to non-existing host:port.
Improvement: the error would still be confusing as we would report that we try to connect to replica. Upon receiving that error, run from replica "info replication" command, extract master_replid and if it's the same as Replica::id_ output the precise error of "can not connect to myself".

kostasrim · 2025-11-24T07:51:48Z

I understand the regression. This is how I would fix it.

Simple way: ReplicaOfInternalV2 calls Start which in turn calls Greet() if Greet fails with Replicating a replica is unsupported stop the flow and do not launch the asynch flow that goes into the inifinite cycle of retrying. Revert the flow as if you would do if trying to connect to non-existing host:port.

Unless I am missing something in your proposal what you write is not true. Greet will succeed, ping will succeed and replica_->Start() will also succeed and we will launch the MainReplicationFiber.

Why do you think any of these will fail if we don't change the state of the node from active to loading ? (Before we would get rejected dragonfly is in loading state and report that replicaof command failed).

We don't want to change the state from active to loading because that's premature and it complicates things and it was one of the goals of replica of v2.

You want to reject Greet or Ping (just like it's done now) -- sure, my question is how because we can't rely on server state anymore

Signed-off-by: Kostas Kyrimis <[email protected]>

kostasrim · 2025-11-24T16:24:40Z

src/server/replica.cc

-  if (!CheckRespIsSimpleReply("OK")) {
-    LOG(WARNING) << "Bad REPLCONF CLIENT-ID response";
-  }
+  PC_RETURN_ON_BAD_RESPONSE(CheckRespIsSimpleReply("OK"));


We log an error "Can't connect to myself" however we don't return that back to the client who sent the replicaof command. We can do this if we replace std::error_code with a custom category which we can translate/provide a custom error msg. Not worth it IMO.

kostasrim · 2025-11-24T16:25:05Z

@romange plz take a look, much cleaner now 👍

romange · 2025-11-24T16:32:28Z

tests/dragonfly/replication_test.py

            pass
+
+
+@dfly_args({"port": 7000})


nit: no need specifying port. you can get it via async_client.connection_pool.connection_kwargs["port"]

romange · 2025-11-24T16:36:40Z

src/server/server_family.cc

        info->id = arg;
      }
+      // If we tried to replicate from ourself reply with an error
+      if (arg == master_replid_) {


hmm, why here, why not in replica inside HandleCapaDflyResp ?
one of the reasons why it is preferable is that during updates, the (old) master does not have this fix, so we will still have the infinite loop. HandleCapaDflyResp is on replica side so you propagate a good behaviour naturally.

Signed-off-by: Kostas Kyrimis <[email protected]>

fix: replica of self

2ab1d65

Signed-off-by: Kostas Kyrimis <[email protected]>

kostasrim self-assigned this Nov 21, 2025

kostasrim requested a review from romange November 21, 2025 13:01

romange reviewed Nov 21, 2025

View reviewed changes

romange reviewed Nov 23, 2025

View reviewed changes

kostasrim added 2 commits November 24, 2025 18:19

chore: comments

c0e7b3f

Signed-off-by: Kostas Kyrimis <[email protected]>

Merge branch 'main' into kpr32

25d55f7

kostasrim commented Nov 24, 2025

View reviewed changes

kostasrim requested a review from romange November 24, 2025 16:24

romange reviewed Nov 24, 2025

View reviewed changes

kostasrim added 2 commits November 24, 2025 19:54

comments

a3edb5c

Signed-off-by: Kostas Kyrimis <[email protected]>

revert prev changes

c5e18c2

Signed-off-by: Kostas Kyrimis <[email protected]>

kostasrim requested a review from romange November 24, 2025 17:58

romange approved these changes Nov 24, 2025

View reviewed changes

kostasrim merged commit 3f09687 into main Nov 25, 2025
10 checks passed

kostasrim deleted the kpr32 branch November 25, 2025 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: replica of self #6097

fix: replica of self #6097

kostasrim commented Nov 21, 2025

Uh oh!

romange left a comment

Uh oh!

romange left a comment

Uh oh!

kostasrim commented Nov 21, 2025

Uh oh!

romange Nov 23, 2025

Uh oh!

romange commented Nov 23, 2025

Uh oh!

kostasrim commented Nov 24, 2025 •

edited

Loading

Uh oh!

kostasrim Nov 24, 2025

Uh oh!

kostasrim commented Nov 24, 2025

Uh oh!

romange Nov 24, 2025

Uh oh!

kostasrim Nov 24, 2025

Uh oh!

romange Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: replica of self #6097

fix: replica of self #6097

Conversation

kostasrim commented Nov 21, 2025

Uh oh!

romange left a comment

Choose a reason for hiding this comment

Uh oh!

romange left a comment

Choose a reason for hiding this comment

Uh oh!

kostasrim commented Nov 21, 2025

Uh oh!

romange Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

romange commented Nov 23, 2025

Uh oh!

kostasrim commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kostasrim Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

kostasrim commented Nov 24, 2025

Uh oh!

romange Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

kostasrim Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

romange Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kostasrim commented Nov 24, 2025 •

edited

Loading