Skip to content

Conversation

RiaPradeep
Copy link

Currently, Xinfra Monitor is designed to run on a single machine. To monitor a Kafka cluster, you might have a single monitor instance running separately. What if the machine running Xinfra Monitor goes down or crashes? There are no more metrics being reported for that cluster, until the monitor is completely restarted.

We (the Kafka team at Bloomberg) propose a modification to Xinfra Monitor to make it highly available, avoiding this problem. A new service, HAMonitoringService, uses Kafka’s AbstractCoordinator to manage multiple instances of Xinfra Monitor running simultaneously, which are put into the same group. The group coordinator selects 1 instance to run the monitor (including internal services producing metrics). If that instance goes down, one of the other instances in the group can quickly take over reporting.

We’ve deployed these changes internally over the past two months with promising results. The monitor consistently reports metrics; any gap in metrics has been less than 5 minutes.

Specifically, these changes create HAMonitoringService, which instantiates and polls the HAMonitoringCoordinator. All instances of Xinfra Monitor will join a group this coordinator manages. The coordinator picks one group member to report metrics, and that instance will start Xinfra Monitor (as defined here). All other instances will stop Xinfra Monitor (defined here).

The HA option can be configured in the .config file like any other service. If no HA config is specified, Xinfra Monitor will run normally.

For example, including the following in the config file would run Xinfra Monitor with this feature:

"HA-monitoring-service": {
  "class.name": "com.linkedin.xinfra.monitor.services.HAMonitoringService",
  "bootstrap.servers": <connection to kafka cluster>,
  "group.id": "HA-monitoring-group"
}

This starting & stopping method leads to the potential of an instance starting reporting, stopping reporting, then later starting again. This required some instantiation to be moved from the constructor to the start method in some services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants