Allow a previously reset node to rejoin its original cluster #13643

SimonUnge · 2025-03-27T21:59:53Z

If a cluster member for whatever reason gets its local state wiped, it has a hard time re-joining the cluster, as the old cluster members will think the node is already a member and reject the request (if mnesia is used).

Proposed Changes

Mnesia: On failure due to 'already a member', ask to leave the cluster first and retry.
Khepri: no-op. Khepri is less strict already, and rabbit_khepri:can_join would accept a join request from a node that is already a member

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

I would like early feedback here, as to if this naive approach is even OK, if there should be a limited set of retries, and if the logic should live in rabbit_mnesia or in rabbit_db_cluster?
It feels a bit wonky that a function called can_join_cluster would also try to leave a cluster and try again, so perhaps it would be better if rabbit_db_cluster:join instead initiates the leave and retry request?

michaelklishin · 2025-03-27T22:30:09Z

It's a reasonable idea but I think that rabbit_db_cluster:join/2 (or its Mnesia-specific codepath) does seem like a better place. Plus there is rabbit_db_cluster:can_join/1 that might need a comment.

SimonUnge · 2025-03-27T23:02:38Z

Yeah, I agree the current position is a but odd. I'll move it, with comments!

…onsider node a member. Khepri: no-op. Khepri is less strict already, and rabbit_khepri:can_join would accept a join request from a node that is already a member

michaelklishin

rabbit_db_cluster now fails gmake dialyze:

rabbit_db_cluster.erl:239:9: The pattern 
          Error = {error, _} can never match since previous clauses completely covered the type 
          {'error', {'inconsistent_cluster', string()}} |
          {'ok', 'already_member' | [atom()]}

lukebakken · 2025-03-31T18:57:45Z

Works on my machine! Test process:

cd rabbitmq-server
git checkout main
make PLUGINS='rabbitmq_management' NODES=3 start-cluster

Note that cluster is running.

./sbin/rabbitmqctl -n rabbit-2@shostakovich shutdown
rm -rf /tmp/rabbitmq-test-instances/rabbit-2@shostakovich/
make PLUGINS='rabbitmq_management' NODES=3 start-cluster

Note that rabbit-2 is not jointed to the cluster, and that the "thinks that it's a cluster member, but node..." disagrees is in the logs. rabbit-2 starts standalone, eventually.

./sbin/rabbitmqctl -n rabbit-2@shostakovich shutdown
rm -rf /tmp/rabbitmq-test-instances/rabbit-2@shostakovich/
git checkout su_aws/try_to_leave_cluster_before_joining
make FULL=1
make PLUGINS='rabbitmq_management' NODES=3 start-cluster

Note that rabbit-2 has successfully joined the cluster.

SimonUnge · 2025-03-31T19:07:55Z

Looking into failing tests.

SimonUnge · 2025-03-31T19:11:39Z

Works on my machine! Test process:
cd rabbitmq-server
git checkout main
make PLUGINS='rabbitmq_management' NODES=3 start-cluster
Note that cluster is running.
./sbin/rabbitmqctl -n rabbit-2@shostakovich shutdown
rm -rf /tmp/rabbitmq-test-instances/rabbit-2@shostakovich/
make PLUGINS='rabbitmq_management' NODES=3 start-cluster
Note that rabbit-2 is not jointed to the cluster, and that the "thinks that it's a cluster member, but node..." disagrees is in the logs. rabbit-2 starts standalone, eventually.
./sbin/rabbitmqctl -n rabbit-2@shostakovich shutdown
rm -rf /tmp/rabbitmq-test-instances/rabbit-2@shostakovich/
git checkout su_aws/try_to_leave_cluster_before_joining
make FULL=1
make PLUGINS='rabbitmq_management' NODES=3 start-cluster
Note that rabbit-2 has successfully joined the cluster.

Just trying to figure out if I messed up some test results that expect this to fail...

deps/rabbit/src/rabbit_mnesia.erl

kjnilsson · 2025-04-01T15:37:26Z

Khepri: no-op. Khepri is less strict already, and rabbit_khepri:can_join would accept a join request from a node that is already a member

in this case khepri would first remove itself then join. This ensures it rejoins as a new member.

Anyhow it makes sense that mnesia would also perform a similar set of steps.

michaelklishin · 2025-04-01T17:05:35Z

The Selenium suite failure is due to an npm dependency installation failure, not anything in this PR.

(cherry picked from commit e6bc6a4)

Allow a previously reset node to rejoin its original cluster (backport #13643)

(cherry picked from commit e6bc6a4) (cherry picked from commit b0eaa57)

Allow a previously reset node to rejoin its original cluster (backport #13643) (backport #13667)

michaelklishin changed the title ~~Allow confused node to rejoin cluster.~~ Allow a previously reset node to rejoin its original cluster Mar 27, 2025

michaelklishin marked this pull request as draft March 27, 2025 23:20

Mnesia: Ask to leave a cluster and retry to join if cluster already c…

dd49cbe

…onsider node a member. Khepri: no-op. Khepri is less strict already, and rabbit_khepri:can_join would accept a join request from a node that is already a member

SimonUnge force-pushed the su_aws/try_to_leave_cluster_before_joining branch from 96332e1 to dd49cbe Compare March 28, 2025 21:24

michaelklishin requested changes Mar 29, 2025

View reviewed changes

SimonUnge added 2 commits March 31, 2025 17:52

Fix dialyzer issue.

9ba545c

Return the exception

e1f2865

SimonUnge added 2 commits March 31, 2025 21:16

Dont handle the exception just let it out there

cdeabe2

Update spec, noconnection is also a possible error

36eb6ca

lukebakken assigned lukebakken and SimonUnge Mar 31, 2025

michaelklishin marked this pull request as ready for review April 1, 2025 01:22

michaelklishin reviewed Apr 1, 2025

View reviewed changes

deps/rabbit/src/rabbit_mnesia.erl Outdated Show resolved Hide resolved

michaelklishin self-requested a review April 1, 2025 16:04

Naming #13643

e6bc6a4

michaelklishin added backport-v4.0.x backport-v4.1.x labels Apr 1, 2025

michaelklishin added this to the 4.1.0 milestone Apr 1, 2025

michaelklishin approved these changes Apr 1, 2025

View reviewed changes

michaelklishin merged commit e83c286 into main Apr 1, 2025
272 of 273 checks passed

michaelklishin deleted the su_aws/try_to_leave_cluster_before_joining branch April 1, 2025 17:20

mergify bot pushed a commit that referenced this pull request Apr 1, 2025

Naming #13643

b0eaa57

(cherry picked from commit e6bc6a4)

mergify bot mentioned this pull request Apr 1, 2025

Allow a previously reset node to rejoin its original cluster (backport #13643) #13667

Merged

12 tasks

michaelklishin added a commit that referenced this pull request Apr 1, 2025

Merge pull request #13667 from rabbitmq/mergify/bp/v4.1.x/pr-13643

2a34a6f

Allow a previously reset node to rejoin its original cluster (backport #13643)

mergify bot pushed a commit that referenced this pull request Apr 1, 2025

Naming #13643

6d45ee8

(cherry picked from commit e6bc6a4) (cherry picked from commit b0eaa57)

mergify bot mentioned this pull request Apr 1, 2025

Allow a previously reset node to rejoin its original cluster (backport #13643) (backport #13667) #13669

Merged

12 tasks

michaelklishin added a commit that referenced this pull request Apr 1, 2025

Merge pull request #13669 from rabbitmq/mergify/bp/v4.0.x/pr-13667

8e998c4

Allow a previously reset node to rejoin its original cluster (backport #13643) (backport #13667)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow a previously reset node to rejoin its original cluster #13643

Allow a previously reset node to rejoin its original cluster #13643

SimonUnge commented Mar 27, 2025

michaelklishin commented Mar 27, 2025 •

edited

Loading

SimonUnge commented Mar 27, 2025

michaelklishin left a comment

lukebakken commented Mar 31, 2025 •

edited

Loading

SimonUnge commented Mar 31, 2025

SimonUnge commented Mar 31, 2025

kjnilsson commented Apr 1, 2025 •

edited

Loading

michaelklishin commented Apr 1, 2025

Allow a previously reset node to rejoin its original cluster #13643

Allow a previously reset node to rejoin its original cluster #13643

Conversation

SimonUnge commented Mar 27, 2025

Proposed Changes

Types of Changes

Checklist

Further Comments

michaelklishin commented Mar 27, 2025 • edited Loading

SimonUnge commented Mar 27, 2025

michaelklishin left a comment

Choose a reason for hiding this comment

lukebakken commented Mar 31, 2025 • edited Loading

SimonUnge commented Mar 31, 2025

SimonUnge commented Mar 31, 2025

kjnilsson commented Apr 1, 2025 • edited Loading

michaelklishin commented Apr 1, 2025

michaelklishin commented Mar 27, 2025 •

edited

Loading

lukebakken commented Mar 31, 2025 •

edited

Loading

kjnilsson commented Apr 1, 2025 •

edited

Loading