[Questions] Classic queue conversion takes very long time after upgrade to 4.0 #12848
Replies: 2 comments 5 replies
-
For reference: this issue (I assume it's the same) was reported once before: https://groups.google.com/g/rabbitmq-users/c/8Ag1jnnLhWw/m/2ogIiQC5AgAJ. Unfortunately we never managed to identify the root cause. I'm relatively sure the problem is that CQv1 had a bug (perhaps fixed years ago?) that under some circumstances left files with no valid messages behind. CQv1 queues affected by this issue might have no messages whatsoever, while still having lots and lots of files that the CQv1->CQv2 migration process needs to go through. Was 3.12.6 the initial version for this environment or was it upgraded before, from some old 3.x versions? Given that we know about quite a lot of successful CQv1->CQv2 migrations (and obviously we ran lots of tests before shipping this), my assumption is that this CQv1 issue was likely fixed years ago (perhaps unknowingly) and only environments that used to run some old 3.x version that were upgraded many times over the years, could have queues with those unnecessary files. But if this can happen on 3.12, that'd be useful info. What could be helpful, would be to try to identify affected queues (in this or other environments) that you have NOT yet tried to migrate and analyse why they have so many files, while they are still "happily" running CQv1. Would it be possible for you to try to find such queues? Basically we are looking for CQv1 queues that have a disproportionally large number of files relative to the number of messages ready in those queues. If you can find queues like that, can you find any commonalities between them (some particular policies or something)? Perhaps if we understand what led to these files not being deleted, we could special-case them for the conversion? |
Beta Was this translation helpful? Give feedback.
-
We've seen this issue again and I managed to gather some evidence. Single node cluster (with many years of history) was upgraded from 3.13.7 to 4.0.7. One queue was taking forever to start during startup. After 2 hours the queue had an empty queue folder so I killed the queue process after which it started up fine. Logs
The queue was looping in
First I noticed the Some important parts of the state pretty printed:
The My assumption is that if there is a queue with only transient messages after a clean RabbitMQ shutdown and startup the transient messages and hence all segment files will be deleted but then I was able to observe this behaviour on main as well. (Although with slight differences eg If the queue is empty then I believe @lhoguin can confirm or deny this theory and would know the remedy immediately. |
Beta Was this translation helpful? Give feedback.
-
Community Support Policy
RabbitMQ version used
4.0.3
Erlang version used
26.2.x
Operating system (distribution) used
linux
How is RabbitMQ deployed?
Debian package
rabbitmq-diagnostics status output
See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics
Logs from node 1 (with sensitive values edited out)
See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Logs from node 2 (if applicable, with sensitive values edited out)
See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Logs from node 3 (if applicable, with sensitive values edited out)
See https://www.rabbitmq.com/docs/logging to learn how to collect logs
rabbitmq.conf
See https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location
Steps to deploy RabbitMQ cluster
Steps to reproduce the behavior in question
advanced.config
See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location
Application code
# PASTE CODE HERE, BETWEEN BACKTICKS
Kubernetes deployment file
What problem are you trying to solve?
A single node cluster was upgraded to 4.0.3 (in multiple steps from 3.12.6 -> 3.12.14 -> 3.13.7 -> 4.0.3)
During the startup after the upgrade the classic queues got converted from v1 to v2.
However some queues got into an infinite (or very long) loop.
2 queues were looping logging the below still after 50 minutes:
Before the upgrade there were less than 20 messages in queues and about 300 queues on the cluster.
observer output
This is the process info of the first queue. Called multiple times (I only repeated the relevant changing sections)
The stacktrace remained the same, always in delete_segment_file. However reductions were increasing (so process was not hanging) and the memory was going up and down a bit around 1 GB.
process info
Unfortunately the node went out-of-memory probably because of the tracing, after which the node started up successfully.
Before the OOM there was only a journal.jif in the queue dir of the 2 queues. All the entries in it are ACK-ed and DEL-ed.
I wonder if anything can be deduced from this much information about why the conversion took so long (and if it would have finished eventually)
Beta Was this translation helpful? Give feedback.
All reactions