Kubernetes: access to vhost '/' refused for user '': vhost '/' is down #14208
-
Describe the bugHi,
We are using different version of rabbitmq: 4.1.0 and 3.13.1 and this problem happened in both version. To recover rabbitmq, we need to delete the vhost data file in I find some issues related to this problem but no definitive solution, only this workaround. It often happens when we update our application and rabbitmq restarts rabbitmq:
image: registry.local:5000/test/docker-images/rabbitmq:3.13.1-management-alpine
hostname: "dashboard_rabbit"
healthcheck:
test: ['CMD', 'rabbitmq-diagnostics', '-q', 'ping']
interval: 60s
timeout: 5s
retries: 3
environment:
- RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS=-rabbit log_levels [{connection,error}]
- RABBITMQ_NODENAME=rabbit@dashboard_rabbit
networks:
test-dashboard:
aliases:
- test-dashboard_rabbitmq
traefik-local:
aliases:
- test-dashboard_rabbitmq
supervision:
aliases:
- test-dashboard_rabbitmq
extra_hosts:
- "dashboard_rabbit:127.0.0.1"
volumes:
- /mnt/test-data/int/data/test/ins-dashboard/dashboard/2.7.0-rc1/11-default-dashboard.conf:/etc/rabbitmq/conf.d/11-default-dashboard.conf
- /mnt/test-data/int/data/test/ins-dashboard/dashboard/2.7.0-rc1/rabbitmq/queues/:/var/lib/rabbitmq/mnesia/
labels:
- test.client=inf
deploy:
update_config:
order: start-first
failure_action: rollback
delay: 10s
rollback_config:
parallelism: 0
order: stop-first
restart_policy:
condition: any
delay: 5s
max_attempts: 0
window: 120s
mode: replicated
replicas: 1
placement:
constraints:
- node.role==worker
- node.labels.infra == ovh
- node.labels.swarmtype == swarmvip
resources:
limits:
memory: 10gb
cpus: '0'
reservations:
memory: 256M
cpus: '0'
labels:
- traefik.enable=true
- traefik.internal=true
- traefik.http.routers.test_dashboard_xeyn5mx8efdq-rabbitmq-router.rule=Host("dashboard-rabbitmq.int.test.local")
- traefik.http.routers.test_dashboard_xeyn5mx8efdq-rabbitmq-router.entrypoints=internalweb
- traefik.http.services.test_dashboard_xeyn5mx8efdq-rabbitmq-services.loadbalancer.server.port=15672 Rabbitmq datas are located on a NFS and we didn't have any problem before Reproduction steps
Expected behaviorUnderstand why it happens and how to fix it definitely Additional contextFeel free to ask |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
RabbitMQ 3.13 has been out of community support for close to a year now. Duplicate of #10052. Time to upgrade to 4.1.x, the only series covered by community support. |
Beta Was this translation helpful? Give feedback.
-
I recall discussing this with another core team member and our conclusion was the following: if the mounted volume is not yet ready for writes by the time the node boots, it will fail to seed the data and the virtual host then will fail to stop. This is pretty clearly hinted at by one of the function names: We have never seen this behavior outside of Kubernetes, and RabbitMQ nodes do not do anything creative when it comes to initializing the schema data store or the CQ message store. So there is nothing to "fix once and for all" in RabbitMQ. A while ago we have considered adding optional delays before and after the node boots, for very different reasons. The former might help here. Or you can inject a startup pause using a Kubernetes-specific method, e.g. an init container that would verify that the volume is ready (writeable). To our knowledge, this behavior was never reported by those who use our Kubernetes Cluster Operator. Most likely because it introduces a startup delay to work around a widely known unfortunate CoreDNS caching behavior/default. So, a similar delay will likely help with volumes not being ready early enough. |
Beta Was this translation helpful? Give feedback.
I recall discussing this with another core team member and our conclusion was the following: if the mounted volume is not yet ready for writes by the time the node boots, it will fail to seed the data and the virtual host then will fail to stop.
This is pretty clearly hinted at by one of the function names:
rabbit_variable_queue:do_start_msg_store/4
.We have never seen this behavior outside of Kubernetes, and RabbitMQ nodes do not do anything creative when it comes to initializing the schema data store or the CQ message store. So there is nothing to "fix once and for all" in RabbitMQ.
A while ago we have considered adding optional delays before and after the node boots, for very different…