[Questions] Classic queue overdelivers to closed channel with global prefetch #13229
Unanswered
gomoripeti
asked this question in
Questions
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Community Support Policy
RabbitMQ version used
other (please specify)
Erlang version used
26.2.x
Operating system (distribution) used
ubuntu
How is RabbitMQ deployed?
Debian package
rabbitmq-diagnostics status output
See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics
Logs from node 1 (with sensitive values edited out)
See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Logs from node 2 (if applicable, with sensitive values edited out)
See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Logs from node 3 (if applicable, with sensitive values edited out)
See https://www.rabbitmq.com/docs/logging to learn how to collect logs
rabbitmq.conf
See https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location
Steps to deploy RabbitMQ cluster
deb package
Steps to reproduce the behavior in question
_
advanced.config
See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location
Application code
# PASTE CODE HERE, BETWEEN BACKTICKS
Kubernetes deployment file
What problem are you trying to solve?
Note: this issue happens when a channel uses global qos which is a deprecated feature (however it is still permitted on latest main and 4.0.x.)
When AMQP 0.9.1 global qos is used the queue process asks the limiter process of the consumer channel if it can deliver the next message. However if the limiter process does not exist or shutting down,
can_send
allows the delivery. This puzzling behaviour results in the queue delivering messages to a dead consumer over the original prefetch count until it runs out of credit (200 by default). While the queue process is executingrun_message_queue
in a loop it does not handleDOWN
messages from terminated consumer channels. There is extra cost of the manycan_send
calls if the queue and channel processes are on different nodes. And the unacknowledged messages have to be loaded and kept in memory this way the queue process memory gets bloated.The problem in production that we've observed is a classic queue getting into a "bad" state when it has a long (~10K) internal gen_server2 message queue, it has a lot of messages (let's say 100K) and more and more go into unacked state although on the consuming client side there is none received. Normally there are 60 consumers of the queue (each from a separate connection/channel) The client application tries to scale out by adding a few more consumers but
basic.consume
always times out and the connections are closed and subscription is retried. Because the queue is always way behind processing thebasic.consume
and channel down events it can never recover. (Or very slow to recover after all the clients are stopped from retrying).Stacktrace of the queue process is dominantly:
My questions:
ExitValue = true
inrabbitmq-server/deps/rabbit/src/rabbit_limiter.erl
Line 221 in a87036b
false
, to prevent sending messages to dead consumers? (I think thattrue
is a leftover from 2008 when the limiter process was shut down when it was not used, but today the limiter is always running next to the channel process but marked by the queue asdormant
if unused)ExitValue = true
affect any other code path than amqp 0.9.1 consumer with global qos? (eg amqp 1.0 consumers?)ExitValue
fromtrue
to `false be accepted?Beta Was this translation helpful? Give feedback.
All reactions