Fix MQTT test flake in Khepri mixed version mode (backport #13235) (backport #13238) #13240

mergify · 2025-02-11T21:19:21Z

The following test flaked in CI under Khepri in mixed version mode:

make -C deps/rabbitmq_mqtt ct-v5 t=cluster_size_3:will_delay_node_restart RABBITMQ_METADATA_STORE=khepri SECONDARY_DIST=rabbitmq_server-4.0.5 FULL=1

The first node took exactly 30 seconds for draining:

2025-02-10 15:00:09.550824+00:00 [debug] <0.1449.0> MQTT accepting TCP connection <0.1449.0> (127.0.0.1:33376 -> 127.0.0.1:27005)
2025-02-10 15:00:09.550992+00:00 [debug] <0.1449.0> Received a CONNECT, client ID: sub0, username: undefined, clean start: true, protocol version: 5, keepalive: 60, property names: ['Session-Expiry-Interval']
2025-02-10 15:00:09.551134+00:00 [debug] <0.1449.0> MQTT connection 127.0.0.1:33376 -> 127.0.0.1:27005 picked vhost using plugin_configuration_or_default_vhost
2025-02-10 15:00:09.551219+00:00 [debug] <0.1449.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal
2025-02-10 15:00:09.551530+00:00 [info] <0.1449.0> Accepted MQTT connection 127.0.0.1:33376 -> 127.0.0.1:27005 for client ID sub0
2025-02-10 15:00:09.551651+00:00 [debug] <0.1449.0> Received a SUBSCRIBE with subscription(s) [{mqtt_subscription,<<"my/topic">>,
2025-02-10 15:00:09.551651+00:00 [debug] <0.1449.0>                                             {mqtt_subscription_opts,0,false,
2025-02-10 15:00:09.551651+00:00 [debug] <0.1449.0>                                              false,0,undefined}}]
2025-02-10 15:00:09.556233+00:00 [debug] <0.896.0> RabbitMQ metadata store: follower leader cast - redirecting to {rabbitmq_metadata,'rmq-ct-mqtt-cluster_size_3-2-27054@localhost'}
2025-02-10 15:00:09.561518+00:00 [debug] <0.1456.0> MQTT accepting TCP connection <0.1456.0> (127.0.0.1:33390 -> 127.0.0.1:27005)
2025-02-10 15:00:09.561634+00:00 [debug] <0.1456.0> Received a CONNECT, client ID: will, username: undefined, clean start: true, protocol version: 5, keepalive: 60, property names: ['Session-Expiry-Interval']
2025-02-10 15:00:09.561715+00:00 [debug] <0.1456.0> MQTT connection 127.0.0.1:33390 -> 127.0.0.1:27005 picked vhost using plugin_configuration_or_default_vhost
2025-02-10 15:00:09.561828+00:00 [debug] <0.1456.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal
2025-02-10 15:00:09.562596+00:00 [info] <0.1456.0> Accepted MQTT connection 127.0.0.1:33390 -> 127.0.0.1:27005 for client ID will
2025-02-10 15:00:09.565743+00:00 [warning] <0.1460.0> This node is being put into maintenance (drain) mode
2025-02-10 15:00:09.565833+00:00 [debug] <0.1460.0> Marking the node as undergoing maintenance
2025-02-10 15:00:09.570772+00:00 [info] <0.1460.0> Marked this node as undergoing maintenance
2025-02-10 15:00:09.570904+00:00 [info] <0.1460.0> Asked to suspend 9 client connection listeners. No new client connections will be accepted until these listeners are resumed!
2025-02-10 15:00:09.572268+00:00 [warning] <0.1460.0> Suspended all listeners and will no longer accept client connections
2025-02-10 15:00:09.572317+00:00 [warning] <0.1460.0> Closed 0 local client connections
2025-02-10 15:00:09.572418+00:00 [warning] <0.1449.0> MQTT disconnecting client <<"127.0.0.1:33376 -> 127.0.0.1:27005">> with client ID 'sub0', reason: maintenance
2025-02-10 15:00:09.572414+00:00 [warning] <0.1000.0> Closed 2 local (Web) MQTT client connections
2025-02-10 15:00:09.572499+00:00 [warning] <0.1456.0> MQTT disconnecting client <<"127.0.0.1:33390 -> 127.0.0.1:27005">> with client ID 'will', reason: maintenance
2025-02-10 15:00:09.572866+00:00 [alert] <0.1000.0> Closed 0 local STOMP client connections
2025-02-10 15:00:09.577432+00:00 [debug] <0.1456.0> scheduled delayed Will Message to topic my/topic for MQTT client ID will to be sent in 10000 ms
2025-02-10 15:00:12.991328+00:00 [debug] <0.1469.0> Will reconcile virtual host processes on all cluster members...
2025-02-10 15:00:12.991443+00:00 [debug] <0.1469.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes
2025-02-10 15:00:12.992497+00:00 [debug] <0.1469.0> Done with virtual host processes reconciliation (run 3)
2025-02-10 15:00:16.511733+00:00 [debug] <0.1476.0> Will reconcile virtual host processes on all cluster members...
2025-02-10 15:00:16.511864+00:00 [debug] <0.1476.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes
2025-02-10 15:00:16.514293+00:00 [debug] <0.1476.0> Done with virtual host processes reconciliation (run 4)
2025-02-10 15:00:24.897477+00:00 [debug] <0.1479.0> Will reconcile virtual host processes on all cluster members...
2025-02-10 15:00:24.897607+00:00 [debug] <0.1479.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes
2025-02-10 15:00:24.898483+00:00 [debug] <0.1479.0> Done with virtual host processes reconciliation (run 5)
2025-02-10 15:00:24.898527+00:00 [debug] <0.1479.0> Will reschedule virtual host process reconciliation after 30 seconds
2025-02-10 15:00:32.994347+00:00 [debug] <0.1484.0> Will reconcile virtual host processes on all cluster members...
2025-02-10 15:00:32.994474+00:00 [debug] <0.1484.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes
2025-02-10 15:00:32.996539+00:00 [debug] <0.1484.0> Done with virtual host processes reconciliation (run 6)
2025-02-10 15:00:32.996585+00:00 [debug] <0.1484.0> Will reschedule virtual host process reconciliation after 30 seconds
2025-02-10 15:00:39.576325+00:00 [info] <0.1460.0> Will transfer leadership of 0 quorum queues with current leader on this node
2025-02-10 15:00:39.576456+00:00 [info] <0.1460.0> Leadership transfer for quorum queues hosted on this node has been initiated
2025-02-10 15:00:39.576948+00:00 [info] <0.1460.0> Will stop local follower replicas of 0 quorum queues on this node
2025-02-10 15:00:39.576990+00:00 [info] <0.1460.0> Stopped all local replicas of quorum queues hosted on this node
2025-02-10 15:00:39.577120+00:00 [info] <0.1460.0> Will transfer leadership of metadata store with current leader on this node
2025-02-10 15:00:39.577282+00:00 [info] <0.1460.0> Khepri clustering: transferring leadership to node 'rmq-ct-mqtt-cluster_size_3-2-27054@localhost'
2025-02-10 15:00:39.577424+00:00 [info] <0.1460.0> Khepri clustering: skipping leadership transfer, leader is already in node 'rmq-ct-mqtt-cluster_size_3-2-27054@localhost'
2025-02-10 15:00:39.577547+00:00 [info] <0.1460.0> Leadership transfer for metadata store on this node has been done. The new leader is 'rmq-ct-mqtt-cluster_size_3-2-27054@localhost'
2025-02-10 15:00:39.577674+00:00 [info] <0.1460.0> Node is ready to be shut down for maintenance or upgrade
2025-02-10 15:00:39.595638+00:00 [notice] <0.64.0> SIGTERM received - shutting down
2025-02-10 15:00:39.595638+00:00 [notice] <0.64.0>
2025-02-10 15:00:39.595758+00:00 [debug] <0.44.0> Running rabbit_prelaunch:shutdown_func() as part of `kernel` shutdown

Running the same test locally revealed that rabbit_maintenance:status_consistent_read/1 takes exactly 30 seconds to complete.

The test case assumes a Will Delay higher than the time it takes to drain and shut down the node. Hence, this commit increases the Will Delay time from 10 seconds to 40 seconds.

This is an automatic backport of pull request #13235 done by Mergify.

This is an automatic backport of pull request #13238 done by Mergify.

The following test flaked in CI under Khepri in mixed version mode: ``` make -C deps/rabbitmq_mqtt ct-v5 t=cluster_size_3:will_delay_node_restart RABBITMQ_METADATA_STORE=khepri SECONDARY_DIST=rabbitmq_server-4.0.5 FULL=1 ``` The first node took exactly 30 seconds for draining: ``` 2025-02-10 15:00:09.550824+00:00 [debug] <0.1449.0> MQTT accepting TCP connection <0.1449.0> (127.0.0.1:33376 -> 127.0.0.1:27005) 2025-02-10 15:00:09.550992+00:00 [debug] <0.1449.0> Received a CONNECT, client ID: sub0, username: undefined, clean start: true, protocol version: 5, keepalive: 60, property names: ['Session-Expiry-Interval'] 2025-02-10 15:00:09.551134+00:00 [debug] <0.1449.0> MQTT connection 127.0.0.1:33376 -> 127.0.0.1:27005 picked vhost using plugin_configuration_or_default_vhost 2025-02-10 15:00:09.551219+00:00 [debug] <0.1449.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal 2025-02-10 15:00:09.551530+00:00 [info] <0.1449.0> Accepted MQTT connection 127.0.0.1:33376 -> 127.0.0.1:27005 for client ID sub0 2025-02-10 15:00:09.551651+00:00 [debug] <0.1449.0> Received a SUBSCRIBE with subscription(s) [{mqtt_subscription,<<"my/topic">>, 2025-02-10 15:00:09.551651+00:00 [debug] <0.1449.0> {mqtt_subscription_opts,0,false, 2025-02-10 15:00:09.551651+00:00 [debug] <0.1449.0> false,0,undefined}}] 2025-02-10 15:00:09.556233+00:00 [debug] <0.896.0> RabbitMQ metadata store: follower leader cast - redirecting to {rabbitmq_metadata,'rmq-ct-mqtt-cluster_size_3-2-27054@localhost'} 2025-02-10 15:00:09.561518+00:00 [debug] <0.1456.0> MQTT accepting TCP connection <0.1456.0> (127.0.0.1:33390 -> 127.0.0.1:27005) 2025-02-10 15:00:09.561634+00:00 [debug] <0.1456.0> Received a CONNECT, client ID: will, username: undefined, clean start: true, protocol version: 5, keepalive: 60, property names: ['Session-Expiry-Interval'] 2025-02-10 15:00:09.561715+00:00 [debug] <0.1456.0> MQTT connection 127.0.0.1:33390 -> 127.0.0.1:27005 picked vhost using plugin_configuration_or_default_vhost 2025-02-10 15:00:09.561828+00:00 [debug] <0.1456.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal 2025-02-10 15:00:09.562596+00:00 [info] <0.1456.0> Accepted MQTT connection 127.0.0.1:33390 -> 127.0.0.1:27005 for client ID will 2025-02-10 15:00:09.565743+00:00 [warning] <0.1460.0> This node is being put into maintenance (drain) mode 2025-02-10 15:00:09.565833+00:00 [debug] <0.1460.0> Marking the node as undergoing maintenance 2025-02-10 15:00:09.570772+00:00 [info] <0.1460.0> Marked this node as undergoing maintenance 2025-02-10 15:00:09.570904+00:00 [info] <0.1460.0> Asked to suspend 9 client connection listeners. No new client connections will be accepted until these listeners are resumed! 2025-02-10 15:00:09.572268+00:00 [warning] <0.1460.0> Suspended all listeners and will no longer accept client connections 2025-02-10 15:00:09.572317+00:00 [warning] <0.1460.0> Closed 0 local client connections 2025-02-10 15:00:09.572418+00:00 [warning] <0.1449.0> MQTT disconnecting client <<"127.0.0.1:33376 -> 127.0.0.1:27005">> with client ID 'sub0', reason: maintenance 2025-02-10 15:00:09.572414+00:00 [warning] <0.1000.0> Closed 2 local (Web) MQTT client connections 2025-02-10 15:00:09.572499+00:00 [warning] <0.1456.0> MQTT disconnecting client <<"127.0.0.1:33390 -> 127.0.0.1:27005">> with client ID 'will', reason: maintenance 2025-02-10 15:00:09.572866+00:00 [alert] <0.1000.0> Closed 0 local STOMP client connections 2025-02-10 15:00:09.577432+00:00 [debug] <0.1456.0> scheduled delayed Will Message to topic my/topic for MQTT client ID will to be sent in 10000 ms 2025-02-10 15:00:12.991328+00:00 [debug] <0.1469.0> Will reconcile virtual host processes on all cluster members... 2025-02-10 15:00:12.991443+00:00 [debug] <0.1469.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes 2025-02-10 15:00:12.992497+00:00 [debug] <0.1469.0> Done with virtual host processes reconciliation (run 3) 2025-02-10 15:00:16.511733+00:00 [debug] <0.1476.0> Will reconcile virtual host processes on all cluster members... 2025-02-10 15:00:16.511864+00:00 [debug] <0.1476.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes 2025-02-10 15:00:16.514293+00:00 [debug] <0.1476.0> Done with virtual host processes reconciliation (run 4) 2025-02-10 15:00:24.897477+00:00 [debug] <0.1479.0> Will reconcile virtual host processes on all cluster members... 2025-02-10 15:00:24.897607+00:00 [debug] <0.1479.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes 2025-02-10 15:00:24.898483+00:00 [debug] <0.1479.0> Done with virtual host processes reconciliation (run 5) 2025-02-10 15:00:24.898527+00:00 [debug] <0.1479.0> Will reschedule virtual host process reconciliation after 30 seconds 2025-02-10 15:00:32.994347+00:00 [debug] <0.1484.0> Will reconcile virtual host processes on all cluster members... 2025-02-10 15:00:32.994474+00:00 [debug] <0.1484.0> Will make sure that processes of 1 virtual hosts are running on all reachable cluster nodes 2025-02-10 15:00:32.996539+00:00 [debug] <0.1484.0> Done with virtual host processes reconciliation (run 6) 2025-02-10 15:00:32.996585+00:00 [debug] <0.1484.0> Will reschedule virtual host process reconciliation after 30 seconds 2025-02-10 15:00:39.576325+00:00 [info] <0.1460.0> Will transfer leadership of 0 quorum queues with current leader on this node 2025-02-10 15:00:39.576456+00:00 [info] <0.1460.0> Leadership transfer for quorum queues hosted on this node has been initiated 2025-02-10 15:00:39.576948+00:00 [info] <0.1460.0> Will stop local follower replicas of 0 quorum queues on this node 2025-02-10 15:00:39.576990+00:00 [info] <0.1460.0> Stopped all local replicas of quorum queues hosted on this node 2025-02-10 15:00:39.577120+00:00 [info] <0.1460.0> Will transfer leadership of metadata store with current leader on this node 2025-02-10 15:00:39.577282+00:00 [info] <0.1460.0> Khepri clustering: transferring leadership to node 'rmq-ct-mqtt-cluster_size_3-2-27054@localhost' 2025-02-10 15:00:39.577424+00:00 [info] <0.1460.0> Khepri clustering: skipping leadership transfer, leader is already in node 'rmq-ct-mqtt-cluster_size_3-2-27054@localhost' 2025-02-10 15:00:39.577547+00:00 [info] <0.1460.0> Leadership transfer for metadata store on this node has been done. The new leader is 'rmq-ct-mqtt-cluster_size_3-2-27054@localhost' 2025-02-10 15:00:39.577674+00:00 [info] <0.1460.0> Node is ready to be shut down for maintenance or upgrade 2025-02-10 15:00:39.595638+00:00 [notice] <0.64.0> SIGTERM received - shutting down 2025-02-10 15:00:39.595638+00:00 [notice] <0.64.0> 2025-02-10 15:00:39.595758+00:00 [debug] <0.44.0> Running rabbit_prelaunch:shutdown_func() as part of `kernel` shutdown ``` Running the same test locally revealed that [rabbit_maintenance:status_consistent_read/1](https://github.com/rabbitmq/rabbitmq-server/blob/55ae91809433d9e6edfcc98563bcb2f0736ee79e/deps/rabbit/src/rabbit_maintenance.erl#L131) takes exactly 30 seconds to complete. The test case assumes a Will Delay higher than the time it takes to drain and shut down the node. Hence, this commit increases the Will Delay time from 10 seconds to 40 seconds. (cherry picked from commit 38cba9d) (cherry picked from commit fa8669f)

michaelklishin added this to the 4.0.7 milestone Feb 11, 2025

michaelklishin merged commit c11fd97 into v4.0.x Feb 11, 2025
1 check was pending

michaelklishin deleted the mergify/bp/v4.0.x/pr-13238 branch February 11, 2025 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MQTT test flake in Khepri mixed version mode (backport #13235) (backport #13238) #13240

Fix MQTT test flake in Khepri mixed version mode (backport #13235) (backport #13238) #13240

mergify bot commented Feb 11, 2025

Fix MQTT test flake in Khepri mixed version mode (backport #13235) (backport #13238) #13240

Fix MQTT test flake in Khepri mixed version mode (backport #13235) (backport #13238) #13240

Conversation

mergify bot commented Feb 11, 2025