-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sysdig basic monitoring guidline documentation #86
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many grammar issues with this doc, please do a proofread before PR next time. Also I can see you are taking contents from other resources, in this case it's better to just put a reference to the original document instead of creating duplications. In terms of the promQL samples, I think they are too basic. For example, the resource usage ones are only reflecting namespace overall status, where a useful metrics would be grouped by workloads or service components. Also it would be better to provide a full query for the network request related monitors instead of a simple metric. All in all i think this would need some rework. Let's chat about it next week before you put more efforts in.
@@ -0,0 +1,118 @@ | |||
--- | |||
title: Sysdig Monitoring Guildline for shared services and apps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title: Sysdig Monitoring Guildline for shared services and apps | |
title: Sysdig Monitoring Guideline for Platform Shared Services |
|
||
slug: sysdig-monitor-setup | ||
|
||
description: Default monitoring standards that will be applied to all our services and apps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description: Default monitoring standards that will be applied to all our services and apps. | |
description: Service Golden Signal - monitoring standards and best practise that will be applied to Platform Shared Services. |
sort_order: 2 | ||
--- | ||
|
||
The four golden signals of SRE are latency, traffic, errors, and saturation. SRE’s golden signals define what it means for the system to be “healthy.” And our monitoring standard will be build based on those four aspect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The four golden signals of SRE are latency, traffic, errors, and saturation. SRE’s golden signals define what it means for the system to be “healthy.” And our monitoring standard will be build based on those four aspect. | |
The four golden signals of Site Reliability Engineering (SRE) are latency, traffic, errors, and saturation. SRE’s golden signals define what it means for the system to be “healthy”. The following monitoring standard will be built based on those four aspects. |
|
||
# Using PromQL | ||
|
||
The Prometheus Query Language (PromQL) is the defacto standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building dashboard in sysdig is havily rely on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Prometheus Query Language (PromQL) is the defacto standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building dashboard in sysdig is havily rely on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/). | |
The Prometheus Query Language (PromQL) is the defacto standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building dashboard in Sysdig is heavily relying on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/). |
|
||
The Prometheus Query Language (PromQL) is the defacto standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building dashboard in sysdig is havily rely on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/). | ||
|
||
# Resources monitoring with Sysdig(Saturation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Resources monitoring with Sysdig(Saturation) | |
## Resources monitoring with Sysdig (Saturation) |
|
||
|
||
#### Registry | ||
The registry is an application where teams can submit requests for provisioning namespaces in OpenShift 4 (OCP4) clusters. The registry allows teams to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call it BC Platform Services Product Registry
for more accurate naming ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe helpful to describe the project structure: API, web frontend, DB, automation provisioner, etc...
- Request that their project namespace be created in additional clusters; | ||
- Request other resources such as KeyCloak realms or Artifactory pull-through repositories; and | ||
- Receive management from the platform services team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does registry actually do number 2 and 3 (with keycloak realms)??? also what does Receive management
mean
- Request other resources such as KeyCloak realms or Artifactory pull-through repositories; and | ||
- Receive management from the platform services team. | ||
|
||
More details about the Registry app and its workings can be found [here](https://github.com/bcgov/platform-services-registry/blob/master/docs/Whole-project-workflow.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the link is invalid
* Retrieving all product information from DB should be less than 8 sec. | ||
* Retrieving 30 product information from DB should be less than 2 sec. | ||
* Web, API, and DB should be up 99.5% of the time | ||
* DB should have a backup every 30 mins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why a backup every 30 mins, for example which SLA does this match to?
|
||
|
||
#### SLO | ||
The definition of s SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. Google has a really [good doc](https://sre.google/workbook/implementing-slos/#:~:text=For%20example%2C%20if%20you%20have,50%25%20of%20the%20error%20budget.) for how to gets the start. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of s SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. Google has a really [good doc](https://sre.google/workbook/implementing-slos/#:~:text=For%20example%2C%20if%20you%20have,50%25%20of%20the%20error%20budget.) for how to gets the start. | |
The definition of SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. Google has a really [good doc](https://sre.google/workbook/implementing-slos/#:~:text=For%20example%2C%20if%20you%20have,50%25%20of%20the%20error%20budget.) for how to gets the start. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the google link, it's important to know how to calculate error budget if ppl were to use the error budget–based approach for reliability, can you provide more details on that end?
* Retrieving 30 product information from DB should be less than 2 sec. | ||
* Web, API, and DB should be up 99.5% of the time | ||
* DB should have a backup every 30 mins | ||
* Provisioner jobs can be completed within 40 mins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
app teams won't understand that provisioner jobs correspond to "Update product requests" in SLA if there is no explaination how things work. might be better to name it as automation jobs for OCP Project Set change requests
|
||
The registry has the following monitoring standards built based on those four aspects. | ||
|
||
* Number of successful HTTP requests / total HTTP requests (success rate) should be greater than 99.99% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this SLIs related to the SLO and SLA defined earlier? for example, why 99.99%? Does is correspond to "Web, API, and DB should be up 99.5% of the time"?
|
||
The Prometheus Query Language (PromQL) is the standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building a dashboard in Sysdig is heavily reliant on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/). | ||
|
||
### Team Scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this section named as team scope?
@@ -0,0 +1,322 @@ | |||
--- | |||
title: SRE Guideline for Platform Shared Services |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR it the file .md still reads Sysdig-Monitoring-Guidline-for-shared-services-and-apps please make sure to update it correctly, I believe @ShellyXueHan made a suggestion
Slug of the website should also be all lower case: sysdig-monitor-setup
|
||
sort_order: 2 | ||
--- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last updated: Month, day, year
Allignment with publishing new pages, corrections for active voice and simple language guidelines. Missing: Related pages content please review
Uptime link wasn't working, corrected this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I wonder if other team members will have time to review this new doc.
@ShellyXueHan may I merge this? been hanging for a long while |
src/docs/app-monitoring/sysdig-monitoring-guideline-for-shared-services-and-apps.md
Outdated
Show resolved
Hide resolved
|
||
Once we establish an SLA that we know will keep our users satisfied, the SLO becomes the minimum commitment we make. As a result, it's in our best interest to identify and address any issues before they breach our SLA, allowing us time to fix them. Breaking this commitment often carries consequences. | ||
|
||
Once again, I'll use the Registry as an example, and we'll consider monthly periods: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain what a monthly period is? it should be a continuous 30 day window instead of calendar months.
|
||
## Service Level Indicators (SLIs) | ||
|
||
SLI golden signals consist of request latency, availability, error rate, and system throughput. These metrics establish the criteria for determining the system's "health." It's essential to grasp the connection and difference between SLIs, SLOs, and SLAs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SLI golden signals consist of request latency, availability, error rate, and system throughput. These metrics establish the criteria for determining the system's "health." It's essential to grasp the connection and difference between SLIs, SLOs, and SLAs. | |
SLI golden signals consist of request latency, availability, error rate, and system throughput. These metrics establish the criteria for determining the system's "health". It's essential to grasp the connection and difference between SLIs, SLOs, and SLAs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what's the difference and connection between SLI, SLO and SLA?
|
||
And so on. | ||
|
||
**Performance Monitoring Tools:** You can use tools like Prometheus, Grafana, or New Relic to continuously watch and display system performance metrics, including response times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add the relative doc links so ppl know how to get started?
|
||
|
||
## Runbook | ||
To achieve 99.5% uptime, we have a daily allowance of just 7 minutes and 12 seconds for downtime. This is where RunBooks prove invaluable. In SRE, the objective is to automate as many processes as feasible. In the realm of cloud operations, Runbooks consist of a series of steps carried out by SREs to accomplish specific tasks. These tasks can encompass incident responses, cost management, addressing performance challenges, and more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you include a brief explanation on runbook, like an automated script that ....
- Scale up and terminate the old pod. | ||
- Notify the development team to assess the network status. | ||
|
||
For more insights on automation runbooks, refer to this [source in Xenonstack](https://www.xenonstack.com/insights/automation-runbook-for-sre). We are also introducing [runwhen](https://www.runwhen.com/) on our platform to aid in this automation process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use runwhen ourselves, but you could say here's a cool tool for that ;)
``` | ||
Error Budget=1−0.95=0.05 or 5% | ||
``` | ||
This means that the service can be "unreliable" or "down" for 5% of the time without violating the SLO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also need to set the length of the calculation time window
In this document, we've covered several essential aspects of Site Reliability Engineering (SRE). From delving into monitoring in detail to implementing tools like Uptime.com and creating useful Runbooks, our focus is on ensuring the continuous and smooth operation of our systems. | ||
|
||
**Quick recap**: | ||
1. **Monitoring**: It's not just about observing; it's our initial defense against issues. The sooner we notice something, the faster we can address it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the less impact it cause!
…-services-and-apps.md Co-authored-by: Shelly Han <[email protected]>
@w8896699 what happened to this branch ? Do we need to keep it open? |
I am doing a cleaning of tech doc branches that were never reviewed again, nor merged (maybe this is too late). @w8896699 or @ShellyXueHan can this be closed or branch deleted? |
No description provided.