|
| 1 | +# Step 1 - Start initial Prometheus servers |
| 2 | + |
| 3 | +Thanos is meant to scale and extend vanilla Prometheus. This means that you can gradually, without disruption, deploy Thanos on top of your existing Prometheus setup. |
| 4 | + |
| 5 | +Let's start our tutorial by spinning up three Prometheus servers. Why three? |
| 6 | +The real advantage of Thanos is when you need to scale out Prometheus from a single replica. Some reason for scale-out might be: |
| 7 | + |
| 8 | +* Adding functional sharding because of metrics high cardinality |
| 9 | +* Need for high availability of Prometheus e.g: Rolling upgrades |
| 10 | +* Aggregating queries from multiple clusters |
| 11 | + |
| 12 | +For this course, let's imagine the following situation: |
| 13 | + |
| 14 | + |
| 15 | + |
| 16 | +1. We have one Prometheus server in some `eu1` cluster. |
| 17 | +2. We have 2 replica Prometheus servers in some `us1` cluster that scrapes the same targets. |
| 18 | + |
| 19 | +Let's start this initial Prometheus setup for now. |
| 20 | + |
| 21 | +## Prometheus Configuration Files |
| 22 | + |
| 23 | +Now, we will prepare configuration files for all Prometheus instances. |
| 24 | + |
| 25 | +Click on the box and it will get copied |
| 26 | + |
| 27 | +Switch on to the Editor tab and make a `prometheus0_eu1.yml` file and paste the above code in it. |
| 28 | + |
| 29 | +First, for the EU Prometheus server that scrapes itself: |
| 30 | + |
| 31 | +``` |
| 32 | +global: |
| 33 | + scrape_interval: 15s |
| 34 | + evaluation_interval: 15s |
| 35 | + external_labels: |
| 36 | + cluster: eu1 |
| 37 | + replica: 0 |
| 38 | +
|
| 39 | +scrape_configs: |
| 40 | + - job_name: 'prometheus' |
| 41 | + static_configs: |
| 42 | + - targets: ['172.17.0.1:9090'] |
| 43 | +```{{copy}} |
| 44 | +
|
| 45 | +
|
| 46 | +For the second cluster we set two replicas: |
| 47 | +
|
| 48 | +Make a `prometheus0_us1.yml` file and paste the above code in it. |
| 49 | +
|
| 50 | +``` |
| 51 | +global: |
| 52 | + scrape_interval: 15s |
| 53 | + evaluation_interval: 15s |
| 54 | + external_labels: |
| 55 | + cluster: us1 |
| 56 | + replica: 0 |
| 57 | + |
| 58 | +scrape_configs: |
| 59 | + - job_name: 'prometheus' |
| 60 | + static_configs: |
| 61 | + - targets: ['172.17.0.1:9091','172.17.0.1:9092'] |
| 62 | +```{{copy}} |
| 63 | +
|
| 64 | +
|
| 65 | +Make a `prometheus1_us1.yml` file and paste the above code in it. |
| 66 | +
|
| 67 | +``` |
| 68 | +global: |
| 69 | + scrape_interval: 15s |
| 70 | + evaluation_interval: 15s |
| 71 | + external_labels: |
| 72 | + cluster: us1 |
| 73 | + replica: 1 |
| 74 | + |
| 75 | +scrape_configs: |
| 76 | + - job_name: 'prometheus' |
| 77 | + static_configs: |
| 78 | + - targets: ['172.17.0.1:9091','172.17.0.1:9092'] |
| 79 | +```{{copy}} |
| 80 | +
|
| 81 | +**NOTE** : Every Prometheus instance must have a globally unique set of identifying labels. These labels are important as they represent certain "stream" of data (e.g in the form of TSDB blocks). Within those exact external labels, the compactions and downsampling are performed, Querier filters its store APIs, further sharding option, deduplication, and potential multi-tenancy capabilities are available. Those are not easy to edit retroactively, so it's important to provide a compatible set of external labels as in order to for Thanos to aggregate data across all the available instances. |
| 82 | +
|
| 83 | +## Starting Prometheus Instances |
| 84 | +
|
| 85 | +Let's now start three containers representing our three different Prometheus instances. |
| 86 | +
|
| 87 | +Please note the extra flags we're passing to Prometheus: |
| 88 | +
|
| 89 | +* `--web.enable-admin-api` allows Thanos Sidecar to get metadata from Prometheus like `external labels`. |
| 90 | +* `--web.enable-lifecycle` allows Thanos Sidecar to reload Prometheus configuration and rule files if used. |
| 91 | +
|
| 92 | +Execute following commands: |
| 93 | +
|
| 94 | +### Prepare "persistent volumes" |
| 95 | +
|
| 96 | +``` |
| 97 | +mkdir -p prometheus0_eu1_data prometheus0_us1_data prometheus1_us1_data |
| 98 | +```{{execute}} |
| 99 | +
|
| 100 | +### Deploying "EU1" |
| 101 | +
|
| 102 | +``` |
| 103 | +docker run -d --net=host --rm \ |
| 104 | + -v $(pwd)/prometheus0_eu1.yml:/etc/prometheus/prometheus.yml \ |
| 105 | + -v $(pwd)/prometheus0_eu1_data:/prometheus \ |
| 106 | + -u root \ |
| 107 | + --name prometheus-0-eu1 \ |
| 108 | + quay.io/prometheus/prometheus:v2.14.0 \ |
| 109 | + --config.file=/etc/prometheus/prometheus.yml \ |
| 110 | + --storage.tsdb.path=/prometheus \ |
| 111 | + --web.listen-address=:9090 \ |
| 112 | + --web.external-url={{TRAFFIC_HOST1_9090}} \ |
| 113 | + --web.enable-lifecycle \ |
| 114 | + --web.enable-admin-api && echo "Prometheus EU1 started!" |
| 115 | +```{{execute}} |
| 116 | +
|
| 117 | +NOTE: We are using the latest Prometheus image so we can take profit from the latest remote read protocol. |
| 118 | +
|
| 119 | +### Deploying "US1" |
| 120 | +
|
| 121 | +``` |
| 122 | +docker run -d --net=host --rm \ |
| 123 | + -v $(pwd)/prometheus0_us1.yml:/etc/prometheus/prometheus.yml \ |
| 124 | + -v $(pwd)/prometheus0_us1_data:/prometheus \ |
| 125 | + -u root \ |
| 126 | + --name prometheus-0-us1 \ |
| 127 | + quay.io/prometheus/prometheus:v2.14.0 \ |
| 128 | + --config.file=/etc/prometheus/prometheus.yml \ |
| 129 | + --storage.tsdb.path=/prometheus \ |
| 130 | + --web.listen-address=:9091 \ |
| 131 | + --web.external-url={{TRAFFIC_HOST1_9091}} \ |
| 132 | + --web.enable-lifecycle \ |
| 133 | + --web.enable-admin-api && echo "Prometheus 0 US1 started!" |
| 134 | +```{{execute}} |
| 135 | +
|
| 136 | +and |
| 137 | +
|
| 138 | +``` |
| 139 | +docker run -d --net=host --rm \ |
| 140 | + -v $(pwd)/prometheus1_us1.yml:/etc/prometheus/prometheus.yml \ |
| 141 | + -v $(pwd)/prometheus1_us1_data:/prometheus \ |
| 142 | + -u root \ |
| 143 | + --name prometheus-1-us1 \ |
| 144 | + quay.io/prometheus/prometheus:v2.14.0 \ |
| 145 | + --config.file=/etc/prometheus/prometheus.yml \ |
| 146 | + --storage.tsdb.path=/prometheus \ |
| 147 | + --web.listen-address=:9092 \ |
| 148 | + --web.external-url={{TRAFFIC_HOST1_9092}} \ |
| 149 | + --web.enable-lifecycle \ |
| 150 | + --web.enable-admin-api && echo "Prometheus 1 US1 started!" |
| 151 | +```{{execute}} |
| 152 | +
|
| 153 | +## Setup Verification |
| 154 | +
|
| 155 | +Once started you should be able to reach all of those Prometheus instances: |
| 156 | +
|
| 157 | +* [Prometheus-0 EU1]({{TRAFFIC_HOST1_9090}}/) |
| 158 | +* [Prometheus-1 US1]({{TRAFFIC_HOST1_9091}}/) |
| 159 | +* [Prometheus-2 US1]({{TRAFFIC_HOST1_9092}}/) |
| 160 | +
|
| 161 | +## Additional info |
| 162 | +
|
| 163 | +Why would one need multiple Prometheus instances? |
| 164 | +
|
| 165 | +* High Availability (multiple replicas) |
| 166 | +* Scaling ingestion: Functional Sharding |
| 167 | +* Multi cluster/environment architecture |
| 168 | +
|
| 169 | +## Problem statement: Global view challenge |
| 170 | +
|
| 171 | +Let's try to play with this setup a bit. You are free to query any metrics, however, let's try to fetch some certain information from |
| 172 | +our multi-cluster setup: **How many series (metrics) we collect overall on all Prometheus instances we have?** |
| 173 | +
|
| 174 | +Tip: Look for `prometheus_tsdb_head_series` metric. |
| 175 | +
|
| 176 | +🕵️♂️ |
| 177 | +
|
| 178 | +Try to get this information from the current setup! |
| 179 | +
|
| 180 | +To see the answer to this question click SHOW SOLUTION below. |
| 181 | +
|
| 182 | +## Next |
| 183 | +
|
| 184 | +Great! We have now running 3 Prometheus instances. |
| 185 | +
|
| 186 | +In the next steps, we will learn how we can install Thanos on top of our initial Prometheus setup to solve problems shown in the challenge. |
0 commit comments