restructured part 7 and added SLO definition

splunk · Jul 18, 2024 · 81807ff · 81807ff
1 parent 1262c8c
commit 81807ff
Show file tree

Hide file tree

Showing 12 changed files with 167 additions and 0 deletions.
diff --git a/...enarios/understand_impact_of_changes/7-alerting-dashboards-slos/1-dashboards.md b/...enarios/understand_impact_of_changes/7-alerting-dashboards-slos/1-dashboards.md
@@ -0,0 +1,48 @@
+---
+title: Use Tags with Dashboards
+linkTitle: 1. ...with Dashboards
+weight: 1
+time: 5 minutes
+---
+
+### Dashboards
+
+Navigate to **Metric Finder**, then type in the name of the tag, which is `credit_score_category` (remember that the dots in the tag name were replaced by underscores when the Monitoring MetricSet was created).  You'll see that multiple metrics include this tag as a dimension:
+
+![Metric Finder](../../images/metric_finder.png)
+
+By default, **Splunk Observability Cloud** calculates several metrics using the trace data it receives.  See [Learn about MetricSets in APM](https://docs.splunk.com/observability/en/apm/span-tags/metricsets.html) for more details.
+
+By creating an MMS, `credit_score_category` was added as a dimension to these metrics, which means that this dimension can now be used for alerting and dashboards.
+
+To see how, let's click on the metric named `service.request.duration.ns.p99`, which brings up the following chart:
+
+![Service Request Duration](../../images/service_request_duration_chart.png)
+
+Add filters for `sf_environment`, `sf_service`, and `sf_dimensionalized`.  Then set the **Extrapolation policy** to `Last value` and the **Display units** to `Nanosecond`:
+
+![Chart with Seconds](../../images/chart_settings.png)
+
+With these settings, the chart allows us to visualize the service request duration by credit score category:
+
+![Duration by Credit Score](../../images/duration_by_credit_score.png)
+
+Now we can see the duration by credit score category. In my example, the red line represents the `exceptional` category, and we can see that the duration for these requests sometimes goes all the way up to 5 seconds.
+
+The orange represents the `very good` category, and has very fast response times.
+
+The green line represents the `poor` category, and has response times between 2-3 seconds.
+
+It may be useful to save this chart on a dashboard for future reference. To do this, click on the **Save as...** button and provide a name for the chart:
+
+![Save Chart As](../../images/save_chart_as.png)
+
+When asked which dashboard to save the chart to, let's create a **new** one named `Credit Check Service - Your Name` (substituting your actual name):
+
+![Save Chart As](../../images/create_dashboard.png)
+
+Now we can see the chart on our dashboard, and can add more charts as needed to monitor our credit check service:
+
+![Credit Check Service Dashboard](../../images/credit_check_service_dashboard.png)
+
+
diff --git a/...scenarios/understand_impact_of_changes/7-alerting-dashboards-slos/2-alerting.md b/...scenarios/understand_impact_of_changes/7-alerting-dashboards-slos/2-alerting.md
@@ -0,0 +1,28 @@
+---
+title: Use Tags with Alerting
+linkTitle: 2. ...with Alerting
+weight: 2
+time: 3 minutes
+---
+
+### Alerts
+
+It's great that we have a dashboard to monitor the response times of the credit check service by credit score, but we don't want to stare at a dashboard all day.
+
+Let's create an alert so we can be notified proactively if customers with `exceptional` credit scores encounter slow requests.
+
+To create this alert, click on the little bell on the top right-hand corner of the chart, then select **New detector from chart**:
+
+![New Detector From Chart](../../images/new_detector_from_chart.png)
+
+Let's call the detector `Latency by Credit Score Category`.  Set the environment to your environment name (i.e. `tagging-workshop-yourname`) then select `creditcheckservice` as the service. Since we only want to look at performance for customers with `exceptional` credit scores, add a filter using the `credit_score_category` dimension and select `exceptional`:
+
+![Create New Detector](../../images/create_new_detector.png)
+
+As an alert condition instead of "**Static threshold**" we want to select "**Sudden Change**" to make the example more vivid. 
+
+![Alert Condition: Sudden Change](../../images/alert_condition_suddenchange.png)
+
+We can then set the remainder of the alert details as we normally would. The key thing to remember here is that without capturing a tag with the credit score category and indexing it, we wouldn't be able to alert at this granular level, but would instead be forced to bucket all customers together, regardless of their importance to the business.
+
+Unless you want to get notified, we do not need to finish this wizard. You can just close the wizard by clicking the **X** on the top right corner of the wizard pop-up.
diff --git a/.../en/scenarios/understand_impact_of_changes/7-alerting-dashboards-slos/3-slos.md b/.../en/scenarios/understand_impact_of_changes/7-alerting-dashboards-slos/3-slos.md
@@ -0,0 +1,54 @@
+---
+title: Use Tags with Service Level Objectives
+linkTitle: 3. ...with SLOs
+weight: 3
+time: 10 minutes
+---
+
+We can now use the created Monitoring MetricSet together with Service Level Objectives a similar way we used them with dashboards and detectors/alerts before. For that we want to be clear about some key concepts: 
+
+## Key Conzepts of Service Level Monitoring
+
+([skip](#creating-a-new-service-level-objective) if you know this)
+
+|Concept|Definition|Examples|
+|---|---|---|
+|Service level indicator (SLI)|An SLI is a quantitative measurement showing some health of a service, expressed as a metric or combination of metrics.|*Availability SLI*: Proportion of requests that resulted in a successful response<br>*Performance SLI*: Proportion of requests that loaded in < 100 ms|
+|Service level objective (SLO)|An SLO defines a target for an SLI and a compliance period over which that target must be met. An SLO contains 3 elements: an SLI, a target, and a compliance period. Compliance periods can be calendar, such as monthly, or rolling, such as past 30 days.|*Availability SLI over a calendar period*: Our service must respond successfully to 95% of requests in a month<br>*Performance SLI over a rolling period*: Our service must respond to 99% of requests in < 100 ms over a 7-day period|
+|Service level agreement (SLA)|An SLA is a contractual agreement that indicates service levels your users can expect from your organization. If an SLA is not met, there can be financial consequences.|A customer service SLA indicates that 90% of support requests received on a normal support day must have a response within 6 hours.|
+|Error budget|A measurement of how your SLI performs relative to your SLO over a period of time. Error budget measures the difference between actual and desired performance. It determines how unreliable your service might be during this period and serves as a signal when you need to take corrective action.|Our service can respond to 1% of requests in >100 ms over a 7 day period.|
+|Burn rate|A unitless measurement of how quickly a service consumes the error budget during the compliance window of the SLO. Burn rate makes the SLO and error budget actionable, showing service owners when a current incident is serious enough to page an on-call responder.|For an SLO with a 30-day compliance window, a constant burn rate of 1 means your error budget is used up in exactly 30 days.|
+
+## Creating a new Service Level Objective
+
+There is an easy to follow wizard to create a new Service Level Objective (SLO). In the left navigation just follow the link "**Detectors & SLOs**". From there select the third tab "**SLOs**" and click the blue button to the right that says "**Create SLO**". 
+
+![Create new SLO](../../images/slo_0_create.png)
+
+The wizard guides you through some easy steps. And if everything during the previous steps worked out, you will have no problems here. ;)
+
+In our case we want to use `Service & endpoint` as our **Metric type** instead of `Custom metric`. We filter the **Environment** down to the environment that we are using during this workshop (i.e. `tagging-workshop-yourname`) and select the `creditcheckservice` from the **Service and endpoint** list. Our **Indicator type** for this workshop will be `Request latency` and not `Request success`. 
+
+Now we can select our **Filters**. Since we are using the `Request latency` as the **Indicator type** and that is a metric of the APM Service, we can filter on `credit.score.category`. Feel free to try out what happens, when you set the **Indicator type** to `Request success`.
+
+Today we are only interested in our `exceptional` credit scores. So please select that as the filter.
+
+![Choose Service or Metric for SLO](../../images/slo_1_choose.png)
+
+In the next step we define the objective we want to reach. For the `Request latency` type, we define the **Target (%)**, the **Latency (ms)** and the **Compliance Window**. Please set these to `99`, `100` and `Last 7 days`. This will give us a good idea what we are achieving already. 
+
+Here we will already be in shock or play around with the numbers to make it not so scary. Feel free to play around with the numbers to see how well we achieve the objective and how much we have left to burn.
+
+![Define Objective for SLO](../../images/slo_2_define_objective.png)
+
+The third step gives us the chance to alert (aka annoy) people who should be aware about these SLOs to initiate countermeasures. These "people" can also be mechanism like ITSM systems or webhooks to initiate automatic remediation steps. 
+
+Activate all categories you want to alert on and add recipients to the different alerts.
+
+![Define Alerting for SLO](../../images/slo_3_define_alerting.png)
+
+The next step is only the naming for this SLO. Have your own naming convention ready for this. In our case we would just name it `creditchceckservice:score:exceptional:YOURNAME` and click the **Create**-button **BUT** you can also **just cancel the wizard** by clicking anything in the left navigation and confirming to **Discard changes**. 
+
+![Name and Save the SLO](../../images/slo_4_name_and_save.png)
+
+And with that we have (*nearly*) successfully created an SLO including the alerting in case we might miss or goals. 
diff --git a/.../en/scenarios/understand_impact_of_changes/7-alerting-dashboards-slos/_index.md b/.../en/scenarios/understand_impact_of_changes/7-alerting-dashboards-slos/_index.md
@@ -0,0 +1,37 @@
+---
+title: Use Tags for Monitoring
+linkTitle: 7. Use Tags for Monitoring
+weight: 7
+time: 15 minutes
+---
+
+
+Earlier, we created a **Troubleshooting Metric Set** on the `credit.score.category` tag, which allowed us to use **Tag Spotlight** with that tag and identify a pattern to explain why some users received a poor experience.
+
+In this section of the workshop, we'll explore a related concept:  **Monitoring MetricSets**.
+
+## What are Monitoring MetricSets?
+
+**Monitoring MetricSets** go beyond troubleshooting and allow us to use tags for alerting, dashboards and SLOs.
+
+## Create a Monitoring MetricSet
+
+(**note**: *your workshop instructor will do the following for you, but observe the steps*)
+
+Let's navigate to **Settings** -> **APM MetricSets**, and click the edit button (i.e. the little pencil) beside the MetricSet for `credit.score.category`.
+
+![edit APM MetricSet](../images/edit_apm_metricset.png)
+
+Check the box beside **Also create Monitoring MetricSet** then click **Start Analysis** 
+
+![Monitoring MetricSet](../images/monitoring_metricset.png)
+
+The `credit.score.category` tag appears again as a **Pending MetricSet**. After a few moments, a checkmark should appear.  Click this checkmark to enable the **Pending MetricSet**.
+
+![pending APM MetricSet](../images/update_pending_apm_metricset.png)
+
+## Using Monitoring MetricSets
+
+This mechanism creates a new dimension from the tag on a bunch of metrics that can be used to filter these metrics based on the values of that new dimension. **Important**: To differentiate between the original and the copy, the dots in the tag name are replaced by underscores for the new dimension. With that the metrics become a dimension named `credit_score_category` and not `credit.score.category`.
+
+Next, let's explore how we can use this **Monitoring MetricSet**.
diff --git a/.../scenarios/understand_impact_of_changes/images/alert_condition_suddenchange.png b/.../scenarios/understand_impact_of_changes/images/alert_condition_suddenchange.png
diff --git a/content/en/scenarios/understand_impact_of_changes/images/edit_apm_metricset.png b/content/en/scenarios/understand_impact_of_changes/images/edit_apm_metricset.png
diff --git a/content/en/scenarios/understand_impact_of_changes/images/slo_0_create.png b/content/en/scenarios/understand_impact_of_changes/images/slo_0_create.png
diff --git a/content/en/scenarios/understand_impact_of_changes/images/slo_1_choose.png b/content/en/scenarios/understand_impact_of_changes/images/slo_1_choose.png
diff --git a/...ent/en/scenarios/understand_impact_of_changes/images/slo_2_define_objective.png b/...ent/en/scenarios/understand_impact_of_changes/images/slo_2_define_objective.png
diff --git a/content/en/scenarios/understand_impact_of_changes/images/slo_3_define_alerting.png b/content/en/scenarios/understand_impact_of_changes/images/slo_3_define_alerting.png
diff --git a/content/en/scenarios/understand_impact_of_changes/images/slo_4_name_and_save.png b/content/en/scenarios/understand_impact_of_changes/images/slo_4_name_and_save.png
diff --git a/.../scenarios/understand_impact_of_changes/images/update_pending_apm_metricset.png b/.../scenarios/understand_impact_of_changes/images/update_pending_apm_metricset.png