|
| 1 | +Despite CockroachDB's various [built-in safeguards against failure](high-availability.html), it is critical to actively monitor the overall health and performance of a cluster running in production and to create alerting rules that promptly send notifications when there are events that require investigation or intervention. |
| 2 | + |
| 3 | +### Configure Prometheus |
| 4 | + |
| 5 | +Every node of a CockroachDB cluster exports granular timeseries metrics formatted for easy integration with [Prometheus](https://prometheus.io/), an open source tool for storing, aggregating, and querying timeseries data. This section shows you how to orchestrate Prometheus as part of your Kubernetes cluster and pull these metrics into Prometheus for external monitoring. |
| 6 | + |
| 7 | +This guidance is based on [CoreOS's Prometheus Operator](https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/getting-started.md), which allows a Prometheus instance to be managed using native Kubernetes concepts. |
| 8 | + |
| 9 | +<section class="filter-content" markdown="1" data-scope="gke-hosted"> |
| 10 | +{{site.data.alerts.callout_info}} |
| 11 | +Before starting, make sure the email address associated with your Google Cloud account is part of the `cluster-admin` RBAC group, as shown in [Step 1. Start Kubernetes](#step-1-start-kubernetes). |
| 12 | +{{site.data.alerts.end}} |
| 13 | +</section> |
| 14 | + |
| 15 | +1. From your local workstation, edit the `cockroachdb` service to add the `prometheus: cockroachdb` label: |
| 16 | + |
| 17 | + {% include copy-clipboard.html %} |
| 18 | + ~~~ shell |
| 19 | + $ kubectl label svc cockroachdb prometheus=cockroachdb |
| 20 | + ~~~ |
| 21 | + |
| 22 | + ~~~ |
| 23 | + service "cockroachdb" labeled |
| 24 | + ~~~ |
| 25 | + |
| 26 | + This ensures that there is a prometheus job and monitoring data only for the `cockroachdb` service, not for the `cockroach-public` service. |
| 27 | + |
| 28 | +2. Install [CoreOS's Prometheus Operator](https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.20/bundle.yaml): |
| 29 | +
|
| 30 | + {% include copy-clipboard.html %} |
| 31 | + ~~~ shell |
| 32 | + $ kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.20/bundle.yaml |
| 33 | + ~~~ |
| 34 | +
|
| 35 | + ~~~ |
| 36 | + clusterrolebinding "prometheus-operator" created |
| 37 | + clusterrole "prometheus-operator" created |
| 38 | + serviceaccount "prometheus-operator" created |
| 39 | + deployment "prometheus-operator" created |
| 40 | + ~~~ |
| 41 | +
|
| 42 | +3. Confirm that the `prometheus-operator` has started: |
| 43 | +
|
| 44 | + {% include copy-clipboard.html %} |
| 45 | + ~~~ shell |
| 46 | + $ kubectl get deploy prometheus-operator |
| 47 | + ~~~ |
| 48 | +
|
| 49 | + ~~~ |
| 50 | + NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE |
| 51 | + prometheus-operator 1 1 1 1 1m |
| 52 | + ~~~ |
| 53 | +
|
| 54 | +4. Use our [`prometheus.yaml`](https://github.com/cockroachdb/cockroach/blob/master/cloud/kubernetes/prometheus/prometheus.yaml) file to create the various objects necessary to run a Prometheus instance: |
| 55 | +
|
| 56 | + {% include copy-clipboard.html %} |
| 57 | + ~~~ shell |
| 58 | + $ kubectl apply -f https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/prometheus/prometheus.yaml |
| 59 | + ~~~ |
| 60 | +
|
| 61 | + ~~~ |
| 62 | + clusterrole "prometheus" created |
| 63 | + clusterrolebinding "prometheus" created |
| 64 | + servicemonitor "cockroachdb" created |
| 65 | + prometheus "cockroachdb" created |
| 66 | + ~~~ |
| 67 | +
|
| 68 | +5. Access the Prometheus UI locally and verify that CockroachDB is feeding data into Prometheus: |
| 69 | +
|
| 70 | + 1. Port-forward from your local machine to the pod running Prometheus: |
| 71 | +
|
| 72 | + {% include copy-clipboard.html %} |
| 73 | + ~~~ shell |
| 74 | + $ kubectl port-forward prometheus-cockroachdb-0 9090 |
| 75 | + ~~~ |
| 76 | +
|
| 77 | + 2. Go to [http://localhost:9090](http://localhost:9090) in your browser. |
| 78 | +
|
| 79 | + 3. To verify that each CockroachDB node is connected to Prometheus, go to **Status > Targets**. The screen should look like this: |
| 80 | +
|
| 81 | + <img src="{{ 'images/v2.1/kubernetes-prometheus-targets.png' | relative_url }}" alt="Prometheus targets" style="border:1px solid #eee;max-width:100%" /> |
| 82 | +
|
| 83 | + 4. To verify that data is being collected, go to **Graph**, enter the `sys_uptime` variable in the field, click **Execute**, and then click the **Graph** tab. The screen should like this: |
| 84 | +
|
| 85 | + <img src="{{ 'images/v2.1/kubernetes-prometheus-graph.png' | relative_url }}" alt="Prometheus graph" style="border:1px solid #eee;max-width:100%" /> |
| 86 | +
|
| 87 | + {{site.data.alerts.callout_success}} |
| 88 | + Prometheus auto-completes CockroachDB time series metrics for you, but if you want to see a full listing, with descriptions, port-forward as described in {% if page.secure == true %}[Access the Admin UI](#step-6-access-the-admin-ui){% else %}[Access the Admin UI](#step-5-access-the-admin-ui){% endif %} and then point your browser to [http://localhost:8080/_status/vars](http://localhost:8080/_status/vars). |
| 89 | +
|
| 90 | + For more details on using the Prometheus UI, see their [official documentation](https://prometheus.io/docs/introduction/getting_started/). |
| 91 | + {{site.data.alerts.end}} |
| 92 | +
|
| 93 | +### Configure Alertmanager |
| 94 | +
|
| 95 | +Active monitoring helps you spot problems early, but it is also essential to send notifications when there are events that require investigation or intervention. This section shows you how to use [Alertmanager](https://prometheus.io/docs/alerting/alertmanager/) and CockroachDB's starter [alerting rules](https://github.com/cockroachdb/cockroach/blob/master/cloud/kubernetes/prometheus/alert-rules.yaml) to do this. |
| 96 | + |
| 97 | +1. Download our <a href="https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/prometheus/alertmanager-config.yaml" download><code>alertmanager-config.yaml</code></a> configuration file. |
| 98 | + |
| 99 | +2. Edit the `alertmanager-config.yaml` file to [specify the desired receivers for notifications](https://prometheus.io/docs/alerting/configuration/). Initially, the file contains a dummy web hook. |
| 100 | + |
| 101 | +3. Add this configuration to the Kubernetes cluster as a secret, renaming it to `alertmanager.yaml` and labelling it to make it easier to find: |
| 102 | + |
| 103 | + {% include copy-clipboard.html %} |
| 104 | + ~~~ shell |
| 105 | + $ kubectl create secret generic alertmanager-cockroachdb --from-file=alertmanager.yaml=alertmanager-config.yaml |
| 106 | + ~~~ |
| 107 | + |
| 108 | + ~~~ |
| 109 | + secret "alertmanager-cockroachdb" created |
| 110 | + ~~~ |
| 111 | + |
| 112 | + {% include copy-clipboard.html %} |
| 113 | + ~~~ shell |
| 114 | + $ kubectl label secret alertmanager-cockroachdb app=cockroachdb |
| 115 | + ~~~ |
| 116 | + |
| 117 | + ~~~ |
| 118 | + secret "alertmanager-cockroachdb" labeled |
| 119 | + ~~~ |
| 120 | + |
| 121 | + {{site.data.alerts.callout_danger}} |
| 122 | + The name of the secret, `alertmanager-cockroachdb`, must match the name used in the `altermanager.yaml` file. If they differ, the Alertmanager instance will start without configuration, and nothing will happen. |
| 123 | + {{site.data.alerts.end}} |
| 124 | + |
| 125 | +4. Use our [`alertmanager.yaml`](https://github.com/cockroachdb/cockroach/blob/master/cloud/kubernetes/prometheus/alertmanager.yaml) file to create the various objects necessary to run an Alertmanager instance, including a ClusterIP service so that Prometheus can forward alerts: |
| 126 | + |
| 127 | + {% include copy-clipboard.html %} |
| 128 | + ~~~ shell |
| 129 | + $ kubectl apply -f https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/prometheus/alertmanager.yaml |
| 130 | + ~~~ |
| 131 | + |
| 132 | + ~~~ |
| 133 | + alertmanager "cockroachdb" created |
| 134 | + service "alertmanager-cockroachdb" created |
| 135 | + ~~~ |
| 136 | + |
| 137 | +5. Verify that Alertmanager is running: |
| 138 | + |
| 139 | + 1. Port-forward from your local machine to the pod running Alertmanager: |
| 140 | + |
| 141 | + {% include copy-clipboard.html %} |
| 142 | + ~~~ shell |
| 143 | + $ kubectl port-forward alertmanager-cockroachdb-0 9093 |
| 144 | + ~~~ |
| 145 | + |
| 146 | + 2. Go to [http://localhost:9093](http://localhost:9093) in your browser. The screen should look like this: |
| 147 | + |
| 148 | + <img src="{{ 'images/v2.1/kubernetes-alertmanager-home.png' | relative_url }}" alt="Alertmanager" style="border:1px solid #eee;max-width:100%" /> |
| 149 | + |
| 150 | +6. Ensure that the Alertmanagers are visible to Prometheus by opening [http://localhost:9090/status](http://localhost:9090/status). The screen should look like this: |
| 151 | + |
| 152 | + <img src="{{ 'images/v2.1/kubernetes-prometheus-alertmanagers.png' | relative_url }}" alt="Alertmanager" style="border:1px solid #eee;max-width:100%" /> |
| 153 | + |
| 154 | +7. Add CockroachDB's starter [alerting rules](https://github.com/cockroachdb/cockroach/blob/master/cloud/kubernetes/prometheus/alert-rules.yaml): |
| 155 | +
|
| 156 | + {% include copy-clipboard.html %} |
| 157 | + ~~~ shell |
| 158 | + $ kubectl apply -f https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/prometheus/alert-rules.yaml |
| 159 | + ~~~ |
| 160 | +
|
| 161 | + ~~~ |
| 162 | + prometheusrule "prometheus-cockroachdb-rules" created |
| 163 | + ~~~ |
| 164 | +
|
| 165 | +8. Ensure that the rules are visible to Prometheus by opening [http://localhost:9090/rules](http://localhost:9090/rules). The screen should look like this: |
| 166 | +
|
| 167 | + <img src="{{ 'images/v2.1/kubernetes-prometheus-alertrules.png' | relative_url }}" alt="Alertmanager" style="border:1px solid #eee;max-width:100%" /> |
| 168 | +
|
| 169 | +9. Verify that the example alert is firing by opening [http://localhost:9090/alerts](http://localhost:9090/alerts). The screen should look like this: |
| 170 | +
|
| 171 | + <img src="{{ 'images/v2.1/kubernetes-prometheus-alerts.png' | relative_url }}" alt="Alertmanager" style="border:1px solid #eee;max-width:100%" /> |
| 172 | +
|
| 173 | +10. To remove the example alert: |
| 174 | +
|
| 175 | + 1. Use the `kubectl edit` command to open the rules for editing: |
| 176 | +
|
| 177 | + {% include copy-clipboard.html %} |
| 178 | + ~~~ shell |
| 179 | + $ kubectl edit prometheusrules prometheus-cockroachdb-rules |
| 180 | + ~~~ |
| 181 | +
|
| 182 | + 2. Remove the `dummy.rules` block and save the file: |
| 183 | +
|
| 184 | + ~~~ |
| 185 | + - name: rules/dummy.rules |
| 186 | + rules: |
| 187 | + - alert: TestAlertManager |
| 188 | + expr: vector(1) |
| 189 | + ~~~ |
0 commit comments