-
Notifications
You must be signed in to change notification settings - Fork 34
Labels
bugSomething isn't workingSomething isn't workingcriticalThe issue is critical and should be fixed ASAPThe issue is critical and should be fixed ASAPcustomerstorage
Description
Currently, the M.this_replicaset.master and M.is_master on the storages may become inconsistent after the instance stops being a master. This causes services on new master not being able to find master, since the old one reports self as non master and master at the same time.
While being in this state, old master doesn't correctly respond to the router's ping, since it sends the state of the upstream to itself, which is missing, and consequently, will be in the unknown state from the view of the routers.
However, the problem should resolve on its own after 40 seconds (REPLICA_NOACTIVITY_TIMEOUT + MASTER_SEARCH_IDLE_INTERVAL * 2), since the connection and master will be dropped.
Reproducer:
diff --git a/test/storage-luatest/auto_master_2_test.lua b/test/storage-luatest/auto_master_2_test.lua
index 1309fce..76da2c7 100644
--- a/test/storage-luatest/auto_master_2_test.lua
+++ b/test/storage-luatest/auto_master_2_test.lua
@@ -105,3 +105,57 @@ test_group.test_turn_off_and_on = function(g)
ilt.assert(ivshard.storage.internal.this_replicaset.is_master_auto)
end)
end
+
+test_group.test_master_search_in_services = function(g)
+ vtest.cluster_rebalancer_enable(g)
+ g.replica_1_b:exec(function(uuid)
+ -- Simulate router's ping, which forces the node to search for master.
+ ivshard.storage._call('info', {timeout = 10, with_health = true})
+ local internal = ivshard.storage.internal
+ ilt.assert_not_equals(internal.this_replicaset.master, nil)
+ ilt.assert_equals(internal.this_replicaset.master.id, uuid)
+ end, {g.replica_1_a:instance_uuid()})
+
+ g.replica_1_a:exec(function()
+ local internal = ivshard.storage.internal
+ ilt.assert_not_equals(internal.this_replicaset.master, nil)
+ ilt.assert_equals(internal.this_replicaset.master.id, box.info.uuid)
+ box.cfg{read_only = true}
+ ilt.helpers.retrying({}, function()
+ internal.instance_watch_fiber:wakeup()
+ ilt.assert_equals(internal.is_master, false)
+ end)
+ -- The replica_1_a is in the incosistent state, it's not master,
+ -- but the replicaset thinks it's.
+ ilt.assert_not_equals(internal.this_replicaset.master, nil)
+ ilt.assert_equals(internal.this_replicaset.master.id, box.info.uuid)
+
+ -- Test, that router's ping doesn't fix the issue.
+ ivshard.storage._call('info', {timeout = 10, with_health = true})
+ local internal = ivshard.storage.internal
+ ilt.assert_not_equals(internal.this_replicaset.master, nil)
+ ilt.assert_equals(internal.this_replicaset.master.id, box.info.uuid)
+
+ -- But this problem persists only until the master is dropped.
+
+ -- ivconst.REPLICA_NOACTIVITY_TIMEOUT = 1
+ -- ilt.helpers.retrying({timeout = 10}, function()
+ -- ivshard.storage._call('info', {timeout = 10, with_health = true})
+ -- ilt.assert_equals(internal.this_replicaset.master, nil)
+ -- end)
+ end)
+
+ g.replica_1_b:exec(function()
+ local internal = ivshard.storage.internal
+ box.cfg{read_only = false}
+ ilt.helpers.retrying({}, function()
+ internal.instance_watch_fiber:wakeup()
+ ilt.assert_equals(internal.is_master, true)
+ end)
+ ilt.helpers.retrying({}, function()
+ internal.rebalancer_fiber:wakeup()
+ ilt.assert(false)
+ end)
+ end)
+ vtest.cluster_rebalancer_disable(g)
+endLogs on the replica_1_b:
2025-10-23 11:10:43.069 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.069 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.070 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.071 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.071 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.072 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.072 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.072 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.073 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcriticalThe issue is critical and should be fixed ASAPThe issue is critical and should be fixed ASAPcustomerstorage