bug with cluster name / node IP metric aggregation #44

JoeTaylor95 · 2023-04-27T13:21:49Z

If in use by an ASG or likely nodes are to be replaced..

The cluster name is likely to change along with the node IP, these really shouldn't be aggregated as when creating an alarm these values are likely to change if a node is replaced and there's a new leader.

It makes it impossible to create an alarm based off these metrics.

A fix for this is to remove metric aggregation and to have the cluster name parametrised

noxdafox · 2023-05-02T12:00:26Z

Hello,

The issue at hand is not clear as stated. Could you please better clarify?

The plugin does not aggregates metrics, this is done by the broker itself via the rabbitmq-management plugin.

You can already customize the dimensions by setting your preferred namespace configuration value.

JoeTaylor95 · 2023-05-02T12:05:16Z

Hi,

the issue is with the metrics that’s aggregated; so there’s node, and cluster name which are being aggregated. The issue is if there’s an alarm that’s created, the metric filter would include the aggregated metrics so node and cluster would be needed. But the issue comes when the node is replaced, this value (rightfully) would change.

A solution for this would be to use a customer cluster name and custom node name or to remove node value from the aggregated metrics, as the cluster name can be changed.

noxdafox · 2023-05-02T12:47:17Z

Can you highlight which metrics are you interested in which get aggregated?

What kind of alarms are you trying to set up with CW metric filters?

Are you aware that you can actually set node names yourself via the RABBITMQ_NODENAME environment variable so they remain static once ASG rotates them?

JoeTaylor95 · 2023-05-02T13:27:47Z

Sure, So I'm creating alarms from the following CW aggregation [Cluster, Limit, Metric, Node, Type] specifically FileDescriptors (threshold > 30000), but also looking at DiskFree and Memory.

Currently the ASG will replace an instance (mainly for system patching, so this will happen once a month min) and the node hostnames are using what AWS set as the defaults, which is fine. (combination of LAN IP)

The issue I have is that If there's multiple nodes which are in a cluster, the alarms can be aggregated but as the node value is irrelevant; I only care about the cluster name as the above metrics would follow each over across the cluster.

Plus, when creating an alarm. if the filter is set to Cluster name Xyz, FileDescriptors > 30k, then this would cover all active nodes within the cluster.

if for example I were to use the node name, as this has to be unique it wouldn't work when CW using alarms as if preserving the node name with would conflict with a node which is having is connections drained.

I hope this makes sense. Effectively if there's any scaling actions in an ASG, any CW alarms would all need updating accordingly.

A fix to this would be to enable to removal of the node metric so its not include in the aggregation or to have an additional metric aggregation where this metric is not include.

Also, the cluster name would also need to be customised, but I think this might be defined in RabbitMQ clustering

noxdafox · 2024-12-08T12:33:12Z

Sorry for the delay in responding to this.

I still fail to see the value of this request. The Node specific metrics are to be assigned to a cluster node otherwise they will simply be collapsed together. Removing the Node dimension from the metric would make the metric itself meaningless.

In your example, you are suggesting to set an alarm for the FileDescriptors growing beyond 30k. If you remove the Node dimension from the picture, you will only be able to set an alarm for the number of open file descriptors over the whole cluster.

Yet, the number of file descriptors is a property of a single server. What if 4 out of 5 nodes have just few thousands of descriptors open and one has most of them? That node will be unhealthy yet your alarm won't trigger.

I do understand the challenge coming from the fact AWS auto-scaling is replacing nodes but the solution is not to be addressed at the metric level but rather at the cluster management level.

You can solve this problem in several ways, a couple of examples:

Setting the RABBITMQ_NODENAME at node startup. You can simply query the cluster, list the running nodes and understand which is the node name to be picked up with a simple exclusion (expected nodes - running nodes).
https://rawcdn.githack.com/rabbitmq/rabbitmq-server/v4.0.4/deps/rabbitmq_management/priv/www/api/index.html
Setting the node hostname at launch: https://docs.aws.amazon.com/linux/al2/ug/set-hostname.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug with cluster name / node IP metric aggregation #44

bug with cluster name / node IP metric aggregation #44

JoeTaylor95 commented Apr 27, 2023

noxdafox commented May 2, 2023

JoeTaylor95 commented May 2, 2023

noxdafox commented May 2, 2023

JoeTaylor95 commented May 2, 2023

noxdafox commented Dec 8, 2024 •

edited

Loading

bug with cluster name / node IP metric aggregation #44

bug with cluster name / node IP metric aggregation #44

Comments

JoeTaylor95 commented Apr 27, 2023

noxdafox commented May 2, 2023

JoeTaylor95 commented May 2, 2023

noxdafox commented May 2, 2023

JoeTaylor95 commented May 2, 2023

noxdafox commented Dec 8, 2024 • edited Loading

noxdafox commented Dec 8, 2024 •

edited

Loading