Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1038,7 +1038,7 @@ Several stock health configurations use host variables to reference dimensions f

##### Prometheus Collector Variables

For metrics collected by the go.d `prometheus` collector, each unique Prometheus label set usually produces a separate chart. The chart ID is built from the metric name followed by `-label=value` pairs for every label (e.g. `kubelet_volume_stats_used_bytes-persistentvolumeclaim=my-pvc`). In the Netdata chart registry, the prefix comes from the go.d job `FullName`: it is `prometheus.<metric_name>-<label_set>` only when the job name is literally `prometheus`; otherwise it is `prometheus_<job_name>.<metric_name>-<label_set>` (for example, `prometheus_local.<metric_name>-<label_set>` or `prometheus_kubelet.<metric_name>-<label_set>`). For summary and histogram metric families, the collector may also emit related chart IDs such as `<id>`, `<id>_sum`, and `<id>_count`, so verify the exact chart ID you want to reference.
For metrics collected by the go.d `prometheus` collector, each unique Prometheus label set usually produces a separate chart. The chart ID is built from the metric name followed by `-label=value` pairs for every label (e.g. `kubelet_volume_stats_used_bytes-persistentvolumeclaim=my-pvc`); characters in a label value that are not chart-ID-safe, such as `.`, are replaced with `_` in the chart ID, while the chart's label keeps the original value (so `addr="10.0.0.1"` yields `…-addr=10_0_0_1`). In the Netdata chart registry, the prefix comes from the go.d job `FullName`: it is `prometheus.<metric_name>-<label_set>` only when the job name is literally `prometheus`; otherwise it is `prometheus_<job_name>.<metric_name>-<label_set>` (for example, `prometheus_local.<metric_name>-<label_set>` or `prometheus_kubelet.<metric_name>-<label_set>`). Summary and histogram families also emit separate `_sum` and `_count` charts; the suffix is part of the metric name, so the IDs are `<metric_name>_sum-<label_set>` and `<metric_name>_count-<label_set>` (just `<metric_name>_sum` / `<metric_name>_count` when the series has no labels), while histogram buckets are dimensions of the base `<metric_name>` chart. Verify the exact chart ID you want to reference.

Because Prometheus chart IDs typically contain hyphens and `=` characters, use the `${...}` brace form to reference them in `calc`/`warn`/`crit` expressions — the unbraced `$var` form stops parsing at `-`. Apply the same rule for both the common `prometheus_<job_name>` prefix and the special-case plain `prometheus` prefix, including any `_sum` or `_count` chart variants.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ Use the `edit-config` script to safely edit configuration files. It automaticall
:::

1. Open the Agent's health notification config:

```bash
sudo ./edit-config health_alarm_notify.conf
```
Expand All @@ -109,6 +110,7 @@ Use the `edit-config` script to safely edit configuration files. It automaticall
3. Define recipients per **role** (see below).

4. Restart the Agent for changes to take effect:

```bash
sudo systemctl restart netdata
```
Expand Down Expand Up @@ -286,7 +288,7 @@ role_recipients_email[sysadmin]="disabled"
If left empty, the default recipient for that method is used.
</details>

<details>
<details id="alert-severity-filtering">
<summary><strong>Alert Severity Filtering</strong></summary><br/>

You can limit certain recipients to only receive **critical** alerts:
Expand All @@ -298,11 +300,47 @@ role_recipients_email[sysadmin]="user1@example.com user2@example.com|critical"
This setup:

- Sends all alerts to `user1@example.com`
- Sends only critical-related alerts to `user2@example.com`
- Sends notifications to `user2@example.com` only once the alarm reaches CRITICAL, then continues sending status changes (including WARNING and CLEAR) until the alarm is cleared.

Works for all supported methods: email, Slack, Telegram, Twilio, Discord, etc.
</details>

<details>
<summary><strong>Controlling Recovered (CLEAR) Notifications</strong></summary><br/>

When an alert returns to normal, Netdata sends a **CLEAR** (recovered) notification. You can control when and whether these are sent.

**Default behavior:** Netdata suppresses CLEAR notifications when the alert was never in a WARNING or CRITICAL state. If `old_status` was not WARNING or CRITICAL and the alert transitions to CLEAR, no notification is sent. This prevents noise from alerts that flap without ever reaching a problem state.

**Enable CLEAR for all transitions:** If your downstream system handles deduplication, set `clear_alarm_always` in `health_alarm_notify.conf` to override the default suppression and send a CLEAR notification regardless of the previous status:

```ini
clear_alarm_always='YES'
```

**Filter by CRITICAL history with the `|critical` modifier:** As described in [Alert Severity Filtering](#alert-severity-filtering) above, `|critical` forwards notifications only for alerts that have reached CRITICAL status. This affects both WARNING and CLEAR:

- **WARNING** notifications are suppressed unless the alarm has previously reached CRITICAL.
- **CLEAR** notifications are only sent when the alert previously passed through CRITICAL. If the alert only went through WARNING → CLEAR, the CLEAR is not forwarded.

```ini
role_recipients_email[sysadmin]="admin@example.com|critical"
```

**Suppress all CLEAR notifications:** Use the `|noclear` modifier to completely block CLEAR notifications for a recipient while still receiving WARNING and CRITICAL alerts:

```ini
role_recipients_email[sysadmin]="admin@example.com|noclear"
```

You can combine modifiers. This example notifies only for alarms that have reached CRITICAL (WARNING is suppressed until then), and excludes CLEAR notifications entirely:

```ini
role_recipients_email[sysadmin]="admin@example.com|critical|noclear"
```

</details>

<details>
<summary><strong>Proxy Settings</strong></summary><br/>

Expand Down Expand Up @@ -411,21 +449,25 @@ Here are solutions for common alert notification issues:
### Email Notifications Not Working

1. Verify your email configuration:

```bash
grep -E "SEND_EMAIL|DEFAULT_RECIPIENT_EMAIL" /etc/netdata/health_alarm_notify.conf
```

2. Check if the system can send mail:

```bash
echo "Test" | mail -s "Test Email" your@email.com
```

3. Look for errors in the Netdata log:

```bash
tail -f /var/log/netdata/error.log | grep "alarm notify"
```

4. Test with debugging enabled:

```bash
sudo su -s /bin/bash netdata
export NETDATA_ALARM_NOTIFY_DEBUG=1
Expand All @@ -435,11 +477,13 @@ Here are solutions for common alert notification issues:
### Slack Notifications Failing

1. Verify your webhook URL is correct:

```bash
grep -E "SLACK_WEBHOOK_URL" /etc/netdata/health_alarm_notify.conf
```

2. Check for network connectivity to Slack:

```bash
curl -X POST -H "Content-type: application/json" --data '{"text":"Test"}' YOUR_WEBHOOK_URL
```
Expand All @@ -449,11 +493,13 @@ Here are solutions for common alert notification issues:
### PagerDuty Integration Issues

1. Verify your service key:

```bash
grep -E "PAGERDUTY_SERVICE_KEY" /etc/netdata/health_alarm_notify.conf
```

2. Test the PagerDuty API directly:

```bash
curl -H "Content-Type: application/json" -X POST -d '{"service_key":"YOUR_SERVICE_KEY","event_type":"trigger","description":"Test"}' https://events.pagerduty.com/generic/2010-04-15/create_event.json
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ The following options can be defined globally: update_every, autodetection_retry
| **Limits** | max_time_series | Global time series limit. If an endpoint returns more time series than this, the data is not processed. | 2000 | no |
| | max_time_series_per_metric | Per-metric time series limit. Metrics with more time series than this are skipped. | 200 | no |
| **Customization** | [fallback_type](#option-customization-fallback-type) | Fallback type rules for untyped metrics. | | no |
| | label_prefix | Optional prefix added to all labels of all charts. Labels will be formatted as `prefix_name`. | | no |
| | [relabeling](#option-customization-relabeling) | Prometheus-compatible metric relabeling, applied before charts are built. | | no |
| **HTTP Auth** | username | Username for Basic HTTP authentication. | | no |
| | password | Password for Basic HTTP authentication. | | no |
| | bearer_token_file | Path to a file containing a bearer token (used for `Authorization: Bearer`). | | no |
Expand Down Expand Up @@ -155,6 +155,36 @@ fallback_type:
```


<a id="option-customization-relabeling"></a>
##### relabeling

A list of relabeling blocks. Each block applies a list of Prometheus
`metric_relabel_configs` rules to the metrics whose name matches `match`. See the
[relabeling reference](https://github.com/netdata/netdata/blob/master/src/go/plugin/go.d/collector/prometheus/relabel/README.md)
for the full action set and more examples.

- `match`: Netdata simple patterns matched against the full metric name — including
any `_bucket`/`_sum`/`_count` suffix, so prefer globs like `app_lat*` over an exact
`app_lat` (space-separated; `*` matches any sequence, `?` any character, a leading
`!` negates). Use `*` to target every metric. Required.
- `metric_relabel_configs`: Prometheus relabel rules (`source_labels`, `separator`,
`regex`, `modulus`, `target_label`, `replacement`, `action`), applied in order to
the scraped samples before charts are built.

Relabeling that would corrupt a histogram or summary — splitting it, dropping a
component, mutating the `le`/`quantile` label, or merging two families — is rejected.

```yaml
relabeling:
- match: 'http_*'
metric_relabel_configs:
- source_labels: [code]
regex: '(\d)\d\d'
target_label: code_class
replacement: '${1}xx'
```



</details>

Expand Down Expand Up @@ -286,6 +316,55 @@ jobs:
```
</details>

###### Metric relabeling

Derive a `code_class` label (2xx, 4xx, ...) on metrics named `http_*`.

<details open>
<summary>Config</summary>

```yaml
jobs:
- name: local
url: http://127.0.0.1:9090/metrics
relabeling:
- match: 'http_*'
metric_relabel_configs:
- source_labels: [code]
regex: '(\d)\d\d'
target_label: code_class
replacement: '${1}xx'

```
</details>

###### Rename labels that collide with Netdata's reserved labels

When these metrics are re-exported in Prometheus format, Netdata adds its own `instance`,
`family`, `chart`, and `dimension` labels. If the scraped endpoint already uses one of those
names, the re-export emits a duplicate label and a downstream Prometheus rejects the scrape.
Rename the colliding labels to avoid it (the use case the former `label_prefix` option served).


<details open>
<summary>Config</summary>

```yaml
jobs:
- name: coredns
url: http://127.0.0.1:9153/metrics
relabeling:
- match: '*'
metric_relabel_configs:
- regex: '(instance|family)'
action: labelmap
replacement: 'coredns_$1'
- regex: '(instance|family)'
action: labeldrop

```
</details>



## Alerts
Expand Down
81 changes: 80 additions & 1 deletion docs/Collecting Metrics/Collectors/Applications/AuthLog.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ The following options can be defined globally: update_every, autodetection_retry
| **Limits** | max_time_series | Global time series limit. If an endpoint returns more time series than this, the data is not processed. | 2000 | no |
| | max_time_series_per_metric | Per-metric time series limit. Metrics with more time series than this are skipped. | 200 | no |
| **Customization** | [fallback_type](#option-customization-fallback-type) | Fallback type rules for untyped metrics. | | no |
| | label_prefix | Optional prefix added to all labels of all charts. Labels will be formatted as `prefix_name`. | | no |
| | [relabeling](#option-customization-relabeling) | Prometheus-compatible metric relabeling, applied before charts are built. | | no |
| **HTTP Auth** | username | Username for Basic HTTP authentication. | | no |
| | password | Password for Basic HTTP authentication. | | no |
| | bearer_token_file | Path to a file containing a bearer token (used for `Authorization: Bearer`). | | no |
Expand Down Expand Up @@ -155,6 +155,36 @@ fallback_type:
```


<a id="option-customization-relabeling"></a>
##### relabeling

A list of relabeling blocks. Each block applies a list of Prometheus
`metric_relabel_configs` rules to the metrics whose name matches `match`. See the
[relabeling reference](https://github.com/netdata/netdata/blob/master/src/go/plugin/go.d/collector/prometheus/relabel/README.md)
for the full action set and more examples.

- `match`: Netdata simple patterns matched against the full metric name — including
any `_bucket`/`_sum`/`_count` suffix, so prefer globs like `app_lat*` over an exact
`app_lat` (space-separated; `*` matches any sequence, `?` any character, a leading
`!` negates). Use `*` to target every metric. Required.
- `metric_relabel_configs`: Prometheus relabel rules (`source_labels`, `separator`,
`regex`, `modulus`, `target_label`, `replacement`, `action`), applied in order to
the scraped samples before charts are built.

Relabeling that would corrupt a histogram or summary — splitting it, dropping a
component, mutating the `le`/`quantile` label, or merging two families — is rejected.

```yaml
relabeling:
- match: 'http_*'
metric_relabel_configs:
- source_labels: [code]
regex: '(\d)\d\d'
target_label: code_class
replacement: '${1}xx'
```



</details>

Expand Down Expand Up @@ -286,6 +316,55 @@ jobs:
```
</details>

###### Metric relabeling

Derive a `code_class` label (2xx, 4xx, ...) on metrics named `http_*`.

<details open>
<summary>Config</summary>

```yaml
jobs:
- name: local
url: http://127.0.0.1:9090/metrics
relabeling:
- match: 'http_*'
metric_relabel_configs:
- source_labels: [code]
regex: '(\d)\d\d'
target_label: code_class
replacement: '${1}xx'

```
</details>

###### Rename labels that collide with Netdata's reserved labels

When these metrics are re-exported in Prometheus format, Netdata adds its own `instance`,
`family`, `chart`, and `dimension` labels. If the scraped endpoint already uses one of those
names, the re-export emits a duplicate label and a downstream Prometheus rejects the scrape.
Rename the colliding labels to avoid it (the use case the former `label_prefix` option served).


<details open>
<summary>Config</summary>

```yaml
jobs:
- name: coredns
url: http://127.0.0.1:9153/metrics
relabeling:
- match: '*'
metric_relabel_configs:
- regex: '(instance|family)'
action: labelmap
replacement: 'coredns_$1'
- regex: '(instance|family)'
action: labeldrop

```
</details>



## Alerts
Expand Down
Loading