Skip to content

Commit 0253ae2

Browse files
committed
replicaset: make locating master faster
When the old master hangs, we firstly make a call to it to check if it's still alive. After that, we contact each replica in the replicaset and wait for their responses. As a result, we end up calling the dead master twice, which causes the master discovery process to lag by 3 * MASTER_SEARCH_TIMEOUT seconds (2 times in `locate_master()` while waiting for responses from dead master and 1 before waking up of the master search fiber). We can reduce the time required to discover the new master by skipping the dead old master during the iteration over replicas. This would limit the delay on the router to a maximum of 2 * MASTER_SEARCH_TIMEOUT per master search iteration, if only one node is down. Closes #549 NO_DOC=bugfix
1 parent 717b9e9 commit 0253ae2

File tree

2 files changed

+9
-0
lines changed

2 files changed

+9
-0
lines changed

test/replicaset-luatest/replicaset_3_test.lua

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -559,7 +559,9 @@ test_group.test_locate_master_when_old_master_hangs = function(g)
559559
g.replica_1_a:freeze()
560560

561561
-- Replicaset is able to locate a new one.
562+
local start_ts = fiber.clock()
562563
local is_done, is_nop = rs:locate_master()
564+
t.assert_lt(fiber.clock() - start_ts, 2 * vconst.MASTER_SEARCH_TIMEOUT)
563565
t.assert(is_done)
564566
t.assert_not(is_nop)
565567
t.assert_equals(rs.master.uuid, g.replica_1_b:instance_uuid())

vshard/replicaset.lua

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1163,7 +1163,13 @@ local function replicaset_locate_master(replicaset)
11631163
local deadline = fiber_clock() + timeout
11641164
local async_opts = {is_async = true, timeout = timeout}
11651165
local replicaset_id = replicaset.id
1166+
local old_master_id = replicaset.master and replicaset.master.id
11661167
for replica_id, replica in pairs(replicaset.replicas) do
1168+
if replica_id == old_master_id then
1169+
-- No need to wait for master one more time, we have just tried to
1170+
-- check it and it didn't respond.
1171+
goto next_replica
1172+
end
11671173
replicaset_connect_to_replica(replicaset, replica)
11681174
ok, err = replica:check_is_connected()
11691175
if ok then
@@ -1176,6 +1182,7 @@ local function replicaset_locate_master(replicaset)
11761182
elseif err ~= nil then
11771183
last_err = err
11781184
end
1185+
::next_replica::
11791186
end
11801187
local master_id
11811188
for replica_id, f in pairs(futures) do

0 commit comments

Comments
 (0)