Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sysdig basic monitoring guidline documentation #86

Closed
wants to merge 19 commits into from

Conversation

w8896699
Copy link
Contributor

No description provided.

@w8896699 w8896699 changed the title documentation check point at Saturation sysdig basic monitoring guidline documentation Jul 21, 2022
@w8896699 w8896699 requested a review from ShellyXueHan July 21, 2022 23:40
@w8896699 w8896699 marked this pull request as ready for review July 21, 2022 23:40
Copy link
Contributor

@ShellyXueHan ShellyXueHan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many grammar issues with this doc, please do a proofread before PR next time. Also I can see you are taking contents from other resources, in this case it's better to just put a reference to the original document instead of creating duplications. In terms of the promQL samples, I think they are too basic. For example, the resource usage ones are only reflecting namespace overall status, where a useful metrics would be grouped by workloads or service components. Also it would be better to provide a full query for the network request related monitors instead of a simple metric. All in all i think this would need some rework. Let's chat about it next week before you put more efforts in.

@@ -0,0 +1,118 @@
---
title: Sysdig Monitoring Guildline for shared services and apps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: Sysdig Monitoring Guildline for shared services and apps
title: Sysdig Monitoring Guideline for Platform Shared Services


slug: sysdig-monitor-setup

description: Default monitoring standards that will be applied to all our services and apps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: Default monitoring standards that will be applied to all our services and apps.
description: Service Golden Signal - monitoring standards and best practise that will be applied to Platform Shared Services.

sort_order: 2
---

The four golden signals of SRE are latency, traffic, errors, and saturation. SRE’s golden signals define what it means for the system to be “healthy.” And our monitoring standard will be build based on those four aspect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The four golden signals of SRE are latency, traffic, errors, and saturation. SRE’s golden signals define what it means for the system to be “healthy.” And our monitoring standard will be build based on those four aspect.
The four golden signals of Site Reliability Engineering (SRE) are latency, traffic, errors, and saturation. SRE’s golden signals define what it means for the system to be “healthy”. The following monitoring standard will be built based on those four aspects.


# Using PromQL

The Prometheus Query Language (PromQL) is the defacto standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building dashboard in sysdig is havily rely on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Prometheus Query Language (PromQL) is the defacto standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building dashboard in sysdig is havily rely on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/).
The Prometheus Query Language (PromQL) is the defacto standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building dashboard in Sysdig is heavily relying on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/).


The Prometheus Query Language (PromQL) is the defacto standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building dashboard in sysdig is havily rely on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/).

# Resources monitoring with Sysdig(Saturation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Resources monitoring with Sysdig(Saturation)
## Resources monitoring with Sysdig (Saturation)

@w8896699 w8896699 requested a review from caggles August 28, 2023 17:27


#### Registry
The registry is an application where teams can submit requests for provisioning namespaces in OpenShift 4 (OCP4) clusters. The registry allows teams to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call it BC Platform Services Product Registry for more accurate naming ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe helpful to describe the project structure: API, web frontend, DB, automation provisioner, etc...

Comment on lines 32 to 34
- Request that their project namespace be created in additional clusters;
- Request other resources such as KeyCloak realms or Artifactory pull-through repositories; and
- Receive management from the platform services team.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does registry actually do number 2 and 3 (with keycloak realms)??? also what does Receive management mean

- Request other resources such as KeyCloak realms or Artifactory pull-through repositories; and
- Receive management from the platform services team.

More details about the Registry app and its workings can be found [here](https://github.com/bcgov/platform-services-registry/blob/master/docs/Whole-project-workflow.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the link is invalid

* Retrieving all product information from DB should be less than 8 sec.
* Retrieving 30 product information from DB should be less than 2 sec.
* Web, API, and DB should be up 99.5% of the time
* DB should have a backup every 30 mins
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why a backup every 30 mins, for example which SLA does this match to?



#### SLO
The definition of s SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. Google has a really [good doc](https://sre.google/workbook/implementing-slos/#:~:text=For%20example%2C%20if%20you%20have,50%25%20of%20the%20error%20budget.) for how to gets the start.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The definition of s SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. Google has a really [good doc](https://sre.google/workbook/implementing-slos/#:~:text=For%20example%2C%20if%20you%20have,50%25%20of%20the%20error%20budget.) for how to gets the start.
The definition of SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. Google has a really [good doc](https://sre.google/workbook/implementing-slos/#:~:text=For%20example%2C%20if%20you%20have,50%25%20of%20the%20error%20budget.) for how to gets the start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the google link, it's important to know how to calculate error budget if ppl were to use the error budget–based approach for reliability, can you provide more details on that end?

* Retrieving 30 product information from DB should be less than 2 sec.
* Web, API, and DB should be up 99.5% of the time
* DB should have a backup every 30 mins
* Provisioner jobs can be completed within 40 mins
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

app teams won't understand that provisioner jobs correspond to "Update product requests" in SLA if there is no explaination how things work. might be better to name it as automation jobs for OCP Project Set change requests


The registry has the following monitoring standards built based on those four aspects.

* Number of successful HTTP requests / total HTTP requests (success rate) should be greater than 99.99%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this SLIs related to the SLO and SLA defined earlier? for example, why 99.99%? Does is correspond to "Web, API, and DB should be up 99.5% of the time"?


The Prometheus Query Language (PromQL) is the standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building a dashboard in Sysdig is heavily reliant on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/).

### Team Scope
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this section named as team scope?

@w8896699 w8896699 requested review from ShellyXueHan and removed request for ksummersill2 August 30, 2023 00:10
@w8896699 w8896699 requested a review from Pilargit12 September 6, 2023 16:30
@w8896699
Copy link
Contributor Author

@Pilargit12

@@ -0,0 +1,322 @@
---
title: SRE Guideline for Platform Shared Services
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR it the file .md still reads Sysdig-Monitoring-Guidline-for-shared-services-and-apps please make sure to update it correctly, I believe @ShellyXueHan made a suggestion

Slug of the website should also be all lower case: sysdig-monitor-setup


sort_order: 2
---

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last updated: Month, day, year

Allignment with publishing new pages, corrections for active voice and simple language guidelines.

Missing: Related pages content please review
Pilargit12 and others added 2 commits September 26, 2023 11:55
Uptime link wasn't working, corrected this.
@Pilargit12 Pilargit12 self-requested a review September 26, 2023 21:57
Copy link
Contributor

@Pilargit12 Pilargit12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I wonder if other team members will have time to review this new doc.

@w8896699
Copy link
Contributor Author

@ShellyXueHan may I merge this? been hanging for a long while


Once we establish an SLA that we know will keep our users satisfied, the SLO becomes the minimum commitment we make. As a result, it's in our best interest to identify and address any issues before they breach our SLA, allowing us time to fix them. Breaking this commitment often carries consequences.

Once again, I'll use the Registry as an example, and we'll consider monthly periods:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain what a monthly period is? it should be a continuous 30 day window instead of calendar months.


## Service Level Indicators (SLIs)

SLI golden signals consist of request latency, availability, error rate, and system throughput. These metrics establish the criteria for determining the system's "health." It's essential to grasp the connection and difference between SLIs, SLOs, and SLAs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SLI golden signals consist of request latency, availability, error rate, and system throughput. These metrics establish the criteria for determining the system's "health." It's essential to grasp the connection and difference between SLIs, SLOs, and SLAs.
SLI golden signals consist of request latency, availability, error rate, and system throughput. These metrics establish the criteria for determining the system's "health". It's essential to grasp the connection and difference between SLIs, SLOs, and SLAs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what's the difference and connection between SLI, SLO and SLA?


And so on.

**Performance Monitoring Tools:** You can use tools like Prometheus, Grafana, or New Relic to continuously watch and display system performance metrics, including response times
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add the relative doc links so ppl know how to get started?



## Runbook
To achieve 99.5% uptime, we have a daily allowance of just 7 minutes and 12 seconds for downtime. This is where RunBooks prove invaluable. In SRE, the objective is to automate as many processes as feasible. In the realm of cloud operations, Runbooks consist of a series of steps carried out by SREs to accomplish specific tasks. These tasks can encompass incident responses, cost management, addressing performance challenges, and more.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you include a brief explanation on runbook, like an automated script that ....

- Scale up and terminate the old pod.
- Notify the development team to assess the network status.

For more insights on automation runbooks, refer to this [source in Xenonstack](https://www.xenonstack.com/insights/automation-runbook-for-sre). We are also introducing [runwhen](https://www.runwhen.com/) on our platform to aid in this automation process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't use runwhen ourselves, but you could say here's a cool tool for that ;)

```
Error Budget=1−0.95=0.05 or 5%
```
This means that the service can be "unreliable" or "down" for 5% of the time without violating the SLO.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to set the length of the calculation time window

In this document, we've covered several essential aspects of Site Reliability Engineering (SRE). From delving into monitoring in detail to implementing tools like Uptime.com and creating useful Runbooks, our focus is on ensuring the continuous and smooth operation of our systems.

**Quick recap**:
1. **Monitoring**: It's not just about observing; it's our initial defense against issues. The sooner we notice something, the faster we can address it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the less impact it cause!

@Pilargit12
Copy link
Contributor

@w8896699 what happened to this branch ? Do we need to keep it open?

@Pilargit12
Copy link
Contributor

Pilargit12 commented Sep 20, 2024

I am doing a cleaning of tech doc branches that were never reviewed again, nor merged (maybe this is too late). @w8896699 or @ShellyXueHan can this be closed or branch deleted?

@Pilargit12 Pilargit12 closed this Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants