Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug with cluster name / node IP metric aggregation #44

Open
JoeTaylor95 opened this issue Apr 27, 2023 · 5 comments
Open

bug with cluster name / node IP metric aggregation #44

JoeTaylor95 opened this issue Apr 27, 2023 · 5 comments

Comments

@JoeTaylor95
Copy link

If in use by an ASG or likely nodes are to be replaced..

The cluster name is likely to change along with the node IP, these really shouldn't be aggregated as when creating an alarm these values are likely to change if a node is replaced and there's a new leader.

It makes it impossible to create an alarm based off these metrics.

A fix for this is to remove metric aggregation and to have the cluster name parametrised

@noxdafox
Copy link
Owner

noxdafox commented May 2, 2023

Hello,

The issue at hand is not clear as stated. Could you please better clarify?

The plugin does not aggregates metrics, this is done by the broker itself via the rabbitmq-management plugin.

You can already customize the dimensions by setting your preferred namespace configuration value.

@JoeTaylor95
Copy link
Author

Hi,

the issue is with the metrics that’s aggregated; so there’s node, and cluster name which are being aggregated. The issue is if there’s an alarm that’s created, the metric filter would include the aggregated metrics so node and cluster would be needed. But the issue comes when the node is replaced, this value (rightfully) would change.

A solution for this would be to use a customer cluster name and custom node name or to remove node value from the aggregated metrics, as the cluster name can be changed.

@noxdafox
Copy link
Owner

noxdafox commented May 2, 2023

Can you highlight which metrics are you interested in which get aggregated?

What kind of alarms are you trying to set up with CW metric filters?

Are you aware that you can actually set node names yourself via the RABBITMQ_NODENAME environment variable so they remain static once ASG rotates them?

@JoeTaylor95
Copy link
Author

Sure, So I'm creating alarms from the following CW aggregation [Cluster, Limit, Metric, Node, Type] specifically FileDescriptors (threshold > 30000), but also looking at DiskFree and Memory.

Currently the ASG will replace an instance (mainly for system patching, so this will happen once a month min) and the node hostnames are using what AWS set as the defaults, which is fine. (combination of LAN IP)

The issue I have is that If there's multiple nodes which are in a cluster, the alarms can be aggregated but as the node value is irrelevant; I only care about the cluster name as the above metrics would follow each over across the cluster.

Plus, when creating an alarm. if the filter is set to Cluster name Xyz, FileDescriptors > 30k, then this would cover all active nodes within the cluster.

if for example I were to use the node name, as this has to be unique it wouldn't work when CW using alarms as if preserving the node name with would conflict with a node which is having is connections drained.

I hope this makes sense. Effectively if there's any scaling actions in an ASG, any CW alarms would all need updating accordingly.

A fix to this would be to enable to removal of the node metric so its not include in the aggregation or to have an additional metric aggregation where this metric is not include.

Also, the cluster name would also need to be customised, but I think this might be defined in RabbitMQ clustering

@noxdafox
Copy link
Owner

noxdafox commented Dec 8, 2024

Sorry for the delay in responding to this.

I still fail to see the value of this request. The Node specific metrics are to be assigned to a cluster node otherwise they will simply be collapsed together. Removing the Node dimension from the metric would make the metric itself meaningless.

In your example, you are suggesting to set an alarm for the FileDescriptors growing beyond 30k. If you remove the Node dimension from the picture, you will only be able to set an alarm for the number of open file descriptors over the whole cluster.

Yet, the number of file descriptors is a property of a single server. What if 4 out of 5 nodes have just few thousands of descriptors open and one has most of them? That node will be unhealthy yet your alarm won't trigger.

I do understand the challenge coming from the fact AWS auto-scaling is replacing nodes but the solution is not to be addressed at the metric level but rather at the cluster management level.

You can solve this problem in several ways, a couple of examples:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants