Skip to content

Conversation

jonathanCaamano
Copy link
Contributor

@jonathanCaamano jonathanCaamano commented Jun 26, 2025

This closes #1876

As talked with @Zerpet and @mkuratczyk in the issue we add some logic to allow scale to zero the rabbitMQ.

Also we add some logic to prevent the scale down when opt-out from zero.

We add new annotation rabbitmq.com/before-zero-replicas-configured to save the replicas configured before put rabbitMQ to zero.

With this annotation we verify if the desired replicas after zero state are equals or greater than replicas before zero state.
If the replicas don't pass the verification it will works like scaleDown.

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

Additional Context

Local Testing

Please ensure you run the unit, integration and system tests before approving the PR.

To run the unit and integration tests:

$ make unit-tests integration-tests

You will need to target a k8s cluster and have the operator deployed for running the system tests.

For example, for a Kubernetes context named dev-bunny:

$ kubectx dev-bunny
$ make destroy deploy-dev
# wait for operator to be deployed
$ make system-tests

@mkuratczyk
Copy link
Contributor

Thanks for the PR. Just FYI, I will certainly test this soon, but need to finish some other things first

@mkuratczyk
Copy link
Contributor

Some initial feedback:

  1. ALLREPLICASREADY shows "true" when all replicas are stopped
# deploy a cluster, set replicas to 0, and then get the cluster:
> kubectl get rmq
NAME   ALLREPLICASREADY   RECONCILESUCCESS   AGE
rmq    True               True               13m

I think it should be set to False when scaled to 0.

  1. Attempt to scale up from zero to a lower number of replicas than it was before scaling to zero, leads to an error:
2025-07-08T17:33:30+02:00	ERROR	Cluster Scale down not supported; tried to scale cluster from 3 nodes to 1 nodes	{"controller": "rabbitmqcluster", "controllerGroup": "rabbitmq.com", "controllerKind": "RabbitmqCluster", "RabbitmqCluster": {"name":"rmq","namespace":"default"}, "namespace": "default", "name": "rmq", "reconcileID": "338516a2-8aeb-447e-97fd-92e1774ae64d", "error": "UnsupportedOperation"}
github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).recordEventsAndSetCondition
	/Users/mkuratczyk/workspace/cluster-operator/controllers/reconcile_scale_zero.go:90
github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).scaleDownFromZero
	/Users/mkuratczyk/workspace/cluster-operator/controllers/reconcile_scale_zero.go:57
github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).Reconcile
	/Users/mkuratczyk/workspace/cluster-operator/controllers/rabbitmqcluster_controller.go:216
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile
	/Users/mkuratczyk/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
	/Users/mkuratczyk/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:340
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
	/Users/mkuratczyk/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:300
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1
	/Users/mkuratczyk/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202

(it is not expected to work as we discussed, but the stacktrace shouldn't be there, unless there's a good reason for it)

  1. Attempt to scale from zero up to a number of replicas higher than before scaling down to zero works, which surprised me:
    steps: deploy a 3 node cluster, set replicas to 0, then set replicas to 5. I don't see any reason for this to cause problems on 4.1+ thanks to the new peer discovery mechanism, but I guess it could cause issues with older RabbitMQ versions. Not sure what to do about this one yet. Perhaps we should keep it like that and just warn that using with older RabbitMQ versions is risky

@jonathanCaamano
Copy link
Contributor Author

Hello @mkuratczyk,

  1. Sure, I do some changes to have ALLREPLICASREADY as false

  2. About this, I follow the same flow as a scale down just defined in the code ( because if the replicas before configured are 3 and now you try to put 1 it represents a scale down). So, if you'd like, I can change it and remove the stack trace.

  3. About the RabbitMQ versions, the version in my cluster is 3.13 and it's working properly Maybe you see something I don't?

Thank you for the feedback

@mkuratczyk
Copy link
Contributor

  1. If there's a stack trace when a scale down is attempted (without scaling down to zero) then I think ideally we should just fix that for both cases. Alternatively, you can ignore it and we can deal with this separately.

  2. I'm not saying it will never work, more that it could lead to random problems. Say we have 1 node, scale to zero and then scale to 3. What if the two new nodes start first for some reason? I think they could form a new cluster, at least in some cases. With 4.1+, that should not happen, since all nodes will wait for the node/pod with -0 suffix:
    https://www.rabbitmq.com/blog/2025/04/04/new-k8s-peer-discovery

@jonathanCaamano
Copy link
Contributor Author

jonathanCaamano commented Jul 14, 2025

Hello @mkuratczyk !

I did some change.

1- Now the ALLREPLICASREADY is false when is scaled to zero.
2- I tried to change this, but it should be analyzed and maybe change the way in the global logger.
3- About this, we change the way, now when you scale the rabbitMQ from zero have to be the same replicas than before zero, if you want scale up first have to put replicas before zero configured. This avoid the problems you told us, always respect the annotation.

Kind regards

@mkuratczyk
Copy link
Contributor

Thanks. My only additional feedback is that the error message is a bit cryptic ("Cluster Scale from zero to other replicas than before configured not supported; tried to scale cluster from 3 nodes to 5 nodes"). Perhaps "unsupported operation: when scaling from zero, you can only restore the previous number of replicas (3)"?

@Zerpet @ansd @MirahImage any thoughts about this PR?

@jonathanCaamano
Copy link
Contributor Author

Hello,

i changed the logger.

@Zerpet Zerpet self-requested a review July 18, 2025 12:08
Copy link
Member

@Zerpet Zerpet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing this PR! I left some comments with feedback that I would like to be addressed before merging.

@jonathanCaamano jonathanCaamano requested a review from Zerpet July 30, 2025 07:38
Copy link
Member

@Zerpet Zerpet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good 👍 I'm going to do some manual QA and I will merge afterwards. Thank you!

@jonathanCaamano
Copy link
Contributor Author

Thanks!
I`ll wait for your QA feedback.

Copy link
Member

@Zerpet Zerpet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to report from the QA. It all worked as expected. Perhaps some may find surprising that AllReplicasReady condition is set to false when scaled to zero, but I'm not against this behaviour.

@Zerpet Zerpet modified the milestones: 2.15.0, 2.16.0 Jul 30, 2025
@Zerpet Zerpet merged commit ee0f974 into rabbitmq:main Jul 30, 2025
13 checks passed
Zerpet added a commit to Zerpet/rabbitmq-website that referenced this pull request Aug 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scale to zero
3 participants