Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a previously reset node to rejoin its original cluster #13643

Merged
merged 6 commits into from
Apr 1, 2025

Conversation

SimonUnge
Copy link
Member

If a cluster member for whatever reason gets its local state wiped, it has a hard time re-joining the cluster, as the old cluster members will think the node is already a member and reject the request (if mnesia is used).

Proposed Changes

Mnesia: On failure due to 'already a member', ask to leave the cluster first and retry.
Khepri: no-op. Khepri is less strict already, and rabbit_khepri:can_join would accept a join request from a node that is already a member

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

I would like early feedback here, as to if this naive approach is even OK, if there should be a limited set of retries, and if the logic should live in rabbit_mnesia or in rabbit_db_cluster?
It feels a bit wonky that a function called can_join_cluster would also try to leave a cluster and try again, so perhaps it would be better if rabbit_db_cluster:join instead initiates the leave and retry request?

@michaelklishin michaelklishin changed the title Allow confused node to rejoin cluster. Allow a previously reset node to rejoin its original cluster Mar 27, 2025
@michaelklishin
Copy link
Member

michaelklishin commented Mar 27, 2025

It's a reasonable idea but I think that rabbit_db_cluster:join/2 (or its Mnesia-specific codepath) does seem like a better place. Plus there is rabbit_db_cluster:can_join/1 that might need a comment.

@SimonUnge
Copy link
Member Author

Yeah, I agree the current position is a but odd. I'll move it, with comments!

@michaelklishin michaelklishin marked this pull request as draft March 27, 2025 23:20
…onsider node a member.

Khepri: no-op. Khepri is less strict already, and rabbit_khepri:can_join would accept a join request from a node that is already a member
@SimonUnge SimonUnge force-pushed the su_aws/try_to_leave_cluster_before_joining branch from 96332e1 to dd49cbe Compare March 28, 2025 21:24
Copy link
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rabbit_db_cluster now fails gmake dialyze:

rabbit_db_cluster.erl:239:9: The pattern 
          Error = {error, _} can never match since previous clauses completely covered the type 
          {'error', {'inconsistent_cluster', string()}} |
          {'ok', 'already_member' | [atom()]}

@lukebakken
Copy link
Collaborator

lukebakken commented Mar 31, 2025

Works on my machine! Test process:

cd rabbitmq-server
git checkout main
make PLUGINS='rabbitmq_management' NODES=3 start-cluster

Note that cluster is running.

./sbin/rabbitmqctl -n rabbit-2@shostakovich shutdown
rm -rf /tmp/rabbitmq-test-instances/rabbit-2@shostakovich/
make PLUGINS='rabbitmq_management' NODES=3 start-cluster

Note that rabbit-2 is not jointed to the cluster, and that the "thinks that it's a cluster member, but node..." disagrees is in the logs. rabbit-2 starts standalone, eventually.

./sbin/rabbitmqctl -n rabbit-2@shostakovich shutdown
rm -rf /tmp/rabbitmq-test-instances/rabbit-2@shostakovich/
git checkout su_aws/try_to_leave_cluster_before_joining
make FULL=1
make PLUGINS='rabbitmq_management' NODES=3 start-cluster

Note that rabbit-2 has successfully joined the cluster.

@SimonUnge
Copy link
Member Author

Looking into failing tests.

@SimonUnge
Copy link
Member Author

Works on my machine! Test process:

cd rabbitmq-server
git checkout main
make PLUGINS='rabbitmq_management' NODES=3 start-cluster

Note that cluster is running.

./sbin/rabbitmqctl -n rabbit-2@shostakovich shutdown
rm -rf /tmp/rabbitmq-test-instances/rabbit-2@shostakovich/
make PLUGINS='rabbitmq_management' NODES=3 start-cluster

Note that rabbit-2 is not jointed to the cluster, and that the "thinks that it's a cluster member, but node..." disagrees is in the logs. rabbit-2 starts standalone, eventually.

./sbin/rabbitmqctl -n rabbit-2@shostakovich shutdown
rm -rf /tmp/rabbitmq-test-instances/rabbit-2@shostakovich/
git checkout su_aws/try_to_leave_cluster_before_joining
make FULL=1
make PLUGINS='rabbitmq_management' NODES=3 start-cluster

Note that rabbit-2 has successfully joined the cluster.

Just trying to figure out if I messed up some test results that expect this to fail...

@michaelklishin michaelklishin marked this pull request as ready for review April 1, 2025 01:22
@kjnilsson
Copy link
Contributor

kjnilsson commented Apr 1, 2025

Khepri: no-op. Khepri is less strict already, and rabbit_khepri:can_join would accept a join request from a node that is already a member

in this case khepri would first remove itself then join. This ensures it rejoins as a new member.

Anyhow it makes sense that mnesia would also perform a similar set of steps.

@michaelklishin michaelklishin self-requested a review April 1, 2025 16:04
@michaelklishin
Copy link
Member

The Selenium suite failure is due to an npm dependency installation failure, not anything in this PR.

@michaelklishin michaelklishin merged commit e83c286 into main Apr 1, 2025
272 of 273 checks passed
@michaelklishin michaelklishin deleted the su_aws/try_to_leave_cluster_before_joining branch April 1, 2025 17:20
mergify bot pushed a commit that referenced this pull request Apr 1, 2025
(cherry picked from commit e6bc6a4)
michaelklishin added a commit that referenced this pull request Apr 1, 2025
Allow a previously reset node to rejoin its original cluster (backport #13643)
mergify bot pushed a commit that referenced this pull request Apr 1, 2025
(cherry picked from commit e6bc6a4)
(cherry picked from commit b0eaa57)
michaelklishin added a commit that referenced this pull request Apr 1, 2025
Allow a previously reset node to rejoin its original cluster (backport #13643) (backport #13667)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants