Skip to content

Changing master with auto master may cause spam of logs and non-working internal services on the new master #604

@Serpentian

Description

@Serpentian

Currently, the M.this_replicaset.master and M.is_master on the storages may become inconsistent after the instance stops being a master. This causes services on new master not being able to find master, since the old one reports self as non master and master at the same time.

While being in this state, old master doesn't correctly respond to the router's ping, since it sends the state of the upstream to itself, which is missing, and consequently, will be in the unknown state from the view of the routers.

However, the problem should resolve on its own after 40 seconds (REPLICA_NOACTIVITY_TIMEOUT + MASTER_SEARCH_IDLE_INTERVAL * 2), since the connection and master will be dropped.

Reproducer:

diff --git a/test/storage-luatest/auto_master_2_test.lua b/test/storage-luatest/auto_master_2_test.lua
index 1309fce..76da2c7 100644
--- a/test/storage-luatest/auto_master_2_test.lua
+++ b/test/storage-luatest/auto_master_2_test.lua
@@ -105,3 +105,57 @@ test_group.test_turn_off_and_on = function(g)
         ilt.assert(ivshard.storage.internal.this_replicaset.is_master_auto)
     end)
 end
+
+test_group.test_master_search_in_services = function(g)
+    vtest.cluster_rebalancer_enable(g)
+    g.replica_1_b:exec(function(uuid)
+        -- Simulate router's ping, which forces the node to search for master.
+        ivshard.storage._call('info', {timeout = 10, with_health = true})
+        local internal = ivshard.storage.internal
+        ilt.assert_not_equals(internal.this_replicaset.master, nil)
+        ilt.assert_equals(internal.this_replicaset.master.id, uuid)
+    end, {g.replica_1_a:instance_uuid()})
+
+    g.replica_1_a:exec(function()
+        local internal = ivshard.storage.internal
+        ilt.assert_not_equals(internal.this_replicaset.master, nil)
+        ilt.assert_equals(internal.this_replicaset.master.id, box.info.uuid)
+        box.cfg{read_only = true}
+        ilt.helpers.retrying({}, function()
+            internal.instance_watch_fiber:wakeup()
+            ilt.assert_equals(internal.is_master, false)
+        end)
+        -- The replica_1_a is in the incosistent state, it's not master,
+        -- but the replicaset thinks it's.
+        ilt.assert_not_equals(internal.this_replicaset.master, nil)
+        ilt.assert_equals(internal.this_replicaset.master.id, box.info.uuid)
+
+        -- Test, that router's ping doesn't fix the issue.
+        ivshard.storage._call('info', {timeout = 10, with_health = true})
+        local internal = ivshard.storage.internal
+        ilt.assert_not_equals(internal.this_replicaset.master, nil)
+        ilt.assert_equals(internal.this_replicaset.master.id, box.info.uuid)
+
+        -- But this problem persists only until the master is dropped.
+
+        -- ivconst.REPLICA_NOACTIVITY_TIMEOUT = 1
+        -- ilt.helpers.retrying({timeout = 10}, function()
+        --     ivshard.storage._call('info', {timeout = 10, with_health = true})
+        --     ilt.assert_equals(internal.this_replicaset.master, nil)
+        -- end)
+    end)
+
+    g.replica_1_b:exec(function()
+        local internal = ivshard.storage.internal
+        box.cfg{read_only = false}
+        ilt.helpers.retrying({}, function()
+            internal.instance_watch_fiber:wakeup()
+            ilt.assert_equals(internal.is_master, true)
+        end)
+        ilt.helpers.retrying({}, function()
+            internal.rebalancer_fiber:wakeup()
+            ilt.assert(false)
+        end)
+    end)
+    vtest.cluster_rebalancer_disable(g)
+end

Logs on the replica_1_b:

2025-10-23 11:10:43.069 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.069 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.070 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.071 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.071 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.072 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.072 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.072 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master
2025-10-23 11:10:43.073 [46032] main/152/vshard.rebalancer/vshard.replicaset replicaset.lua:1144 W> Replica 00000000-0000-0000-0000-000000000001 in replicaset 00000000-0000-0000-0000-000000000003 reports self as both master and not master

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcriticalThe issue is critical and should be fixed ASAPcustomerstorage

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions