Skip to content

Conversation

@kostasrim
Copy link
Contributor

The new replica of algorithm does not prematurely change the state of dragonfly to loading. When replica of points to self it will try to connect to itself and succeed, change the state to loading and get stuck trying to call replconf on itself (which fails because now !master).

The fix is to check if we got connected on the same node and simply do not start replication at all.

Fixes #6091

Signed-off-by: Kostas Kyrimis <[email protected]>
@kostasrim kostasrim self-assigned this Nov 21, 2025
@kostasrim kostasrim requested a review from romange November 21, 2025 13:01
Copy link
Collaborator

@romange romange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why replica would point to itself?

Copy link
Collaborator

@romange romange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to understand the flow better, I do not like the fix, but I need first to understand the context before suggesting something different.

@kostasrim
Copy link
Contributor Author

Why replica would point to itself?

I wrote this on the issue. The hypothesis is that somehow they end up calling replicaof on the same node. I replicated both from 1.24 to 1.25 and vice versa, attempted a takeover and added a new replica all without any issues. I was going over the code and I discovered the regression. Then I looked on the logs from the issue

W20251119 12 main_service.cc:1672]  REPLCONF listening-port 6379 failed with reason: Replicating a replica is unsupported

And it is the exact same error I saw locally and then it clicked.

I need to understand the flow better,

They call replicaof self_host self_port. replica_->Start() passes without an error. So now the node is connected to itself. We switch the state from master to replica. Now we start initializing the flows which call REPLCONF and they get back " REPLCONF listening-port 6379 failed with reason: Replicating a replica is unsupported" and this gets stuck in a loop.

I do not like the fix, but I need first to understand the context before suggesting something different.

I am happy for suggestions. It's important to not prematurely switch the state when we call Start() though. Anything else that is simpler than this I am all eyes :)

@dfly_args({"port": 7000})
async def test_replica_of_self(async_client):
with pytest.raises(redis.exceptions.ResponseError):
await async_client.execute_command("replicaof localhost 6379")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the reproduction scenario. You try to replicate from non-existing master yet the test is called - replica_of_self.

@romange
Copy link
Collaborator

romange commented Nov 23, 2025

I understand the regression. This is how I would fix it.

  1. Simple way: ReplicaOfInternalV2 calls Start which in turn calls Greet() if Greet fails with Replicating a replica is unsupported stop the flow and do not launch the asynch flow that goes into the inifinite cycle of retrying. Revert the flow as if you would do if trying to connect to non-existing host:port.

  2. Improvement: the error would still be confusing as we would report that we try to connect to replica. Upon receiving that error, run from replica "info replication" command, extract master_replid and if it's the same as Replica::id_ output the precise error of "can not connect to myself".

@kostasrim
Copy link
Contributor Author

kostasrim commented Nov 24, 2025

I understand the regression. This is how I would fix it.

  1. Simple way: ReplicaOfInternalV2 calls Start which in turn calls Greet() if Greet fails with Replicating a replica is unsupported stop the flow and do not launch the asynch flow that goes into the inifinite cycle of retrying. Revert the flow as if you would do if trying to connect to non-existing host:port.

Unless I am missing something in your proposal what you write is not true. Greet will succeed, ping will succeed and replica_->Start() will also succeed and we will launch the MainReplicationFiber.

Why do you think any of these will fail if we don't change the state of the node from active to loading ? (Before we would get rejected dragonfly is in loading state and report that replicaof command failed).

We don't want to change the state from active to loading because that's premature and it complicates things and it was one of the goals of replica of v2.

You want to reject Greet or Ping (just like it's done now) -- sure, my question is how because we can't rely on server state anymore

if (!CheckRespIsSimpleReply("OK")) {
LOG(WARNING) << "Bad REPLCONF CLIENT-ID response";
}
PC_RETURN_ON_BAD_RESPONSE(CheckRespIsSimpleReply("OK"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We log an error "Can't connect to myself" however we don't return that back to the client who sent the replicaof command. We can do this if we replace std::error_code with a custom category which we can translate/provide a custom error msg. Not worth it IMO.

@kostasrim kostasrim requested a review from romange November 24, 2025 16:24
@kostasrim
Copy link
Contributor Author

@romange plz take a look, much cleaner now 👍

pass


@dfly_args({"port": 7000})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need specifying port. you can get it via async_client.connection_pool.connection_kwargs["port"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀 👀

info->id = arg;
}
// If we tried to replicate from ourself reply with an error
if (arg == master_replid_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why here, why not in replica inside HandleCapaDflyResp ?
one of the reasons why it is preferable is that during updates, the (old) master does not have this fix, so we will still have the infinite loop. HandleCapaDflyResp is on replica side so you propagate a good behaviour naturally.

Signed-off-by: Kostas Kyrimis <[email protected]>
Signed-off-by: Kostas Kyrimis <[email protected]>
@kostasrim kostasrim requested a review from romange November 24, 2025 17:58
@kostasrim kostasrim merged commit 3f09687 into main Nov 25, 2025
10 checks passed
@kostasrim kostasrim deleted the kpr32 branch November 25, 2025 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replication Errors After Updating to 1.35.0

3 participants