sysdig basic monitoring guidline documentation #86

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

w8896699 wants to merge 19 commits into main from sysdig-monitoring-docu

Contributor

w8896699 commented Jul 20, 2022

No description provided.

w8896699 added 3 commits

July 20, 2022 15:15


          documentation check point at Saturation

d15e2ac


          merge conflict

3e6e63c


          sysdig basic monitoring guidline

b693d9d

w8896699 changed the title ~~documentation check point at Saturation~~ sysdig basic monitoring guidline documentation

w8896699 requested a review from ShellyXueHan

July 21, 2022 23:40

w8896699 marked this pull request as ready for review

July 21, 2022 23:40

ShellyXueHan requested changes

View reviewed changes

Contributor

ShellyXueHan left a comment

There are many grammar issues with this doc, please do a proofread before PR next time. Also I can see you are taking contents from other resources, in this case it's better to just put a reference to the original document instead of creating duplications. In terms of the promQL samples, I think they are too basic. For example, the resource usage ones are only reflecting namespace overall status, where a useful metrics would be grouped by workloads or service components. Also it would be better to provide a full query for the network request related monitors instead of a simple metric. All in all i think this would need some rework. Let's chat about it next week before you put more efforts in.

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated

		@@ -0,0 +1,118 @@
		---
		title: Sysdig Monitoring Guildline for shared services and apps

Contributor

ShellyXueHan Jul 22, 2022

Suggested change

      
            title: Sysdig Monitoring Guildline for shared services and apps
          
            title: Sysdig Monitoring Guideline for Platform Shared Services

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated


		slug: sysdig-monitor-setup

		description: Default monitoring standards that will be applied to all our services and apps.

Contributor

ShellyXueHan Jul 22, 2022

Suggested change

      
            description: Default monitoring standards that will be applied to all our services and apps.
          
            description: Service Golden Signal - monitoring standards and best practise that will be applied to Platform Shared Services.

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated

+              sort_order: 2
+              ---
+              The four golden signals of SRE are latency, traffic, errors, and saturation. SRE’s golden signals define what it means for the system to be “healthy.”  And our monitoring standard will be build based on those four aspect.

Contributor

ShellyXueHan Jul 22, 2022

Suggested change

      
            The four golden signals of SRE are latency, traffic, errors, and saturation. SRE’s golden signals define what it means for the system to be “healthy.”  And our monitoring standard will be build based on those four aspect.
          
            The four golden signals of Site Reliability Engineering (SRE) are latency, traffic, errors, and saturation. SRE’s golden signals define what it means for the system to be “healthy”. The following monitoring standard will be built based on those four aspects.

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated


		# Using PromQL

		The Prometheus Query Language (PromQL) is the defacto standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building dashboard in sysdig is havily rely on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/).

Contributor

ShellyXueHan Jul 22, 2022

Suggested change

      
            The Prometheus Query Language (PromQL) is the defacto standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building dashboard in sysdig is havily rely on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/).
          
            The Prometheus Query Language (PromQL) is the defacto standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building dashboard in Sysdig is heavily relying on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/).

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated


		The Prometheus Query Language (PromQL) is the defacto standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building dashboard in sysdig is havily rely on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/).

		# Resources monitoring with Sysdig(Saturation)

Contributor

ShellyXueHan Jul 22, 2022

Suggested change

      
            # Resources monitoring with Sysdig(Saturation)
          
            ## Resources monitoring with Sysdig (Saturation)

w8896699 added 7 commits

July 29, 2022 12:37


          doc update

69face5


          app name parameter

f342be3


          checkpoint

4c8e9b8


          checkpint


          checkpint

632d8d0


          checkpint

7ad5b8d


          grama fix

bf8d075

w8896699 requested a review from ShellyXueHan

October 26, 2022 01:44

ShellyXueHan requested a review from ksummersill2

October 26, 2022 16:48

w8896699 added 2 commits

December 9, 2022 13:57


          add dashboard documentation

720fe05


          doc update

a24fbaa

w8896699 requested a review from caggles

August 28, 2023 17:27

ShellyXueHan reviewed

View reviewed changes

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated



		#### Registry
		The registry is an application where teams can submit requests for provisioning namespaces in OpenShift 4 (OCP4) clusters. The registry allows teams to:

Contributor

ShellyXueHan Aug 28, 2023

let's call it BC Platform Services Product Registry for more accurate naming ;)

Contributor

ShellyXueHan Aug 28, 2023

maybe helpful to describe the project structure: API, web frontend, DB, automation provisioner, etc...

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated

Comment on lines 32 to 34

+              - Request that their project namespace be created in additional clusters;
+              - Request other resources such as KeyCloak realms or Artifactory pull-through repositories; and
+              - Receive management from the platform services team.

Contributor

ShellyXueHan Aug 28, 2023

does registry actually do number 2 and 3 (with keycloak realms)??? also what does Receive management mean

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated

+              - Request other resources such as KeyCloak realms or Artifactory pull-through repositories; and
+              - Receive management from the platform services team.
+              More details about the Registry app and its workings can be found [here](https://github.com/bcgov/platform-services-registry/blob/master/docs/Whole-project-workflow.md).

Contributor

ShellyXueHan Aug 28, 2023

the link is invalid

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated

+              * Retrieving all product information from DB should be less than 8 sec.
+              * Retrieving 30 product information from DB should be less than 2 sec.
+              * Web, API, and DB should be up 99.5% of the time
+              * DB should have a backup every 30 mins

Contributor

ShellyXueHan Aug 28, 2023

can you explain why a backup every 30 mins, for example which SLA does this match to?

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated



		#### SLO
		The definition of s SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. Google has a really [good doc](https://sre.google/workbook/implementing-slos/#:~:text=For%20example%2C%20if%20you%20have,50%25%20of%20the%20error%20budget.) for how to gets the start.

Contributor

ShellyXueHan Aug 28, 2023

Suggested change

      
            The definition of s SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. Google has a really [good doc](https://sre.google/workbook/implementing-slos/#:~:text=For%20example%2C%20if%20you%20have,50%25%20of%20the%20error%20budget.) for how to gets the start.
          
            The definition of SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. Google has a really [good doc](https://sre.google/workbook/implementing-slos/#:~:text=For%20example%2C%20if%20you%20have,50%25%20of%20the%20error%20budget.) for how to gets the start.

Contributor

ShellyXueHan Aug 28, 2023

based on the google link, it's important to know how to calculate error budget if ppl were to use the error budget–based approach for reliability, can you provide more details on that end?

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated

+              * Retrieving 30 product information from DB should be less than 2 sec.
+              * Web, API, and DB should be up 99.5% of the time
+              * DB should have a backup every 30 mins
+              * Provisioner jobs can be completed within 40 mins

Contributor

ShellyXueHan Aug 28, 2023

app teams won't understand that provisioner jobs correspond to "Update product requests" in SLA if there is no explaination how things work. might be better to name it as automation jobs for OCP Project Set change requests

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated


		The registry has the following monitoring standards built based on those four aspects.

		* Number of successful HTTP requests / total HTTP requests (success rate) should be greater than 99.99%

Contributor

ShellyXueHan Aug 28, 2023

how does this SLIs related to the SLO and SLA defined earlier? for example, why 99.99%? Does is correspond to "Web, API, and DB should be up 99.5% of the time"?

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated


		The Prometheus Query Language (PromQL) is the standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building a dashboard in Sysdig is heavily reliant on PromQL. The PromQL language is documented at [Prometheus Query Basics](https://prometheus.io/docs/prometheus/latest/querying/basics/).

		### Team Scope

Contributor

ShellyXueHan Aug 28, 2023

why is this section named as team scope?

w8896699 added 3 commits

August 29, 2023 17:02


          documentation update

01c5a09


          documentation update

d43508a


          documentation update

2a19642

w8896699 requested review from ShellyXueHan and removed request for ksummersill2

August 30, 2023 00:10

w8896699 requested a review from Pilargit12

September 6, 2023 16:30

Contributor Author

w8896699 commented Sep 25, 2023

Pilargit12 reviewed

View reviewed changes

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated

		@@ -0,0 +1,322 @@
		---
		title: SRE Guideline for Platform Shared Services

Contributor

Pilargit12 Sep 26, 2023

In the PR it the file .md still reads Sysdig-Monitoring-Guidline-for-shared-services-and-apps please make sure to update it correctly, I believe @ShellyXueHan made a suggestion

Slug of the website should also be all lower case: sysdig-monitor-setup

Pilargit12 reviewed

View reviewed changes

src/docs/app-monitoring/Sysdig-Monitoring-Guidline-for-shared-services-and-apps.md Outdated


		sort_order: 2
		---

Contributor

Pilargit12 Sep 26, 2023

Last updated: Month, day, year


          Content revision

0035f20

Allignment with publishing new pages, corrections for active voice and simple language guidelines.

Missing: Related pages content please review

Pilargit12 and others added 2 commits

September 26, 2023 11:55


          small change to on this page links

3e10d2d

Uptime link wasn't working, corrected this.


          add related pages

139b242

Pilargit12 self-requested a review

September 26, 2023 21:57

Pilargit12 approved these changes

View reviewed changes

Contributor

Pilargit12 left a comment

Looks good to me, I wonder if other team members will have time to review this new doc.

Contributor Author

w8896699 commented Jan 29, 2024

@ShellyXueHan may I merge this? been hanging for a long while

ShellyXueHan reviewed

View reviewed changes

src/docs/app-monitoring/sysdig-monitoring-guideline-for-shared-services-and-apps.md Outdated Show resolved Hide resolved

src/docs/app-monitoring/sysdig-monitoring-guideline-for-shared-services-and-apps.md


		Once we establish an SLA that we know will keep our users satisfied, the SLO becomes the minimum commitment we make. As a result, it's in our best interest to identify and address any issues before they breach our SLA, allowing us time to fix them. Breaking this commitment often carries consequences.

		Once again, I'll use the Registry as an example, and we'll consider monthly periods:

Contributor

ShellyXueHan Jan 30, 2024

can you explain what a monthly period is? it should be a continuous 30 day window instead of calendar months.

ShellyXueHan reviewed

View reviewed changes

src/docs/app-monitoring/sysdig-monitoring-guideline-for-shared-services-and-apps.md


		## Service Level Indicators (SLIs)

		SLI golden signals consist of request latency, availability, error rate, and system throughput. These metrics establish the criteria for determining the system's "health." It's essential to grasp the connection and difference between SLIs, SLOs, and SLAs.

Contributor

ShellyXueHan Jan 30, 2024

Suggested change

      
            SLI golden signals consist of request latency, availability, error rate, and system throughput. These metrics establish the criteria for determining the system's "health." It's essential to grasp the connection and difference between SLIs, SLOs, and SLAs.
          
            SLI golden signals consist of request latency, availability, error rate, and system throughput. These metrics establish the criteria for determining the system's "health". It's essential to grasp the connection and difference between SLIs, SLOs, and SLAs.

Contributor

ShellyXueHan Jan 30, 2024

so what's the difference and connection between SLI, SLO and SLA?

src/docs/app-monitoring/sysdig-monitoring-guideline-for-shared-services-and-apps.md


		And so on.

		Performance Monitoring Tools: You can use tools like Prometheus, Grafana, or New Relic to continuously watch and display system performance metrics, including response times

Contributor

ShellyXueHan Jan 30, 2024

could you add the relative doc links so ppl know how to get started?

src/docs/app-monitoring/sysdig-monitoring-guideline-for-shared-services-and-apps.md



		## Runbook
		To achieve 99.5% uptime, we have a daily allowance of just 7 minutes and 12 seconds for downtime. This is where RunBooks prove invaluable. In SRE, the objective is to automate as many processes as feasible. In the realm of cloud operations, Runbooks consist of a series of steps carried out by SREs to accomplish specific tasks. These tasks can encompass incident responses, cost management, addressing performance challenges, and more.

Contributor

ShellyXueHan Jan 30, 2024

could you include a brief explanation on runbook, like an automated script that ....

src/docs/app-monitoring/sysdig-monitoring-guideline-for-shared-services-and-apps.md

+                - Scale up and terminate the old pod.
+                - Notify the development team to assess the network status.
+              For more insights on automation runbooks, refer to this [source in Xenonstack](https://www.xenonstack.com/insights/automation-runbook-for-sre). We are also introducing [runwhen](https://www.runwhen.com/) on our platform to aid in this automation process.

Contributor

ShellyXueHan Jan 30, 2024

we don't use runwhen ourselves, but you could say here's a cool tool for that ;)

src/docs/app-monitoring/sysdig-monitoring-guideline-for-shared-services-and-apps.md

+              ```
+              Error Budget=1−0.95=0.05 or 5%
+              ```
+              This means that the service can be "unreliable" or "down" for 5% of the time without violating the SLO.

Contributor

ShellyXueHan Jan 30, 2024

also need to set the length of the calculation time window

src/docs/app-monitoring/sysdig-monitoring-guideline-for-shared-services-and-apps.md

+              In this document, we've covered several essential aspects of Site Reliability Engineering (SRE). From delving into monitoring in detail to implementing tools like Uptime.com and creating useful Runbooks, our focus is on ensuring the continuous and smooth operation of our systems.
+              **Quick recap**:
+. **Monitoring**: It's not just about observing; it's our initial defense against issues. The sooner we notice something, the faster we can address it.

Contributor

ShellyXueHan Jan 30, 2024

and the less impact it cause!

w8896699 mentioned this pull request

Update SLA Documentation for Host Tier bcgov/developer-experience#4649

Closed


          Update src/docs/app-monitoring/sysdig-monitoring-guideline-for-shared…

680e682

…-services-and-apps.md

Co-authored-by: Shelly Han <[email protected]>

Contributor

Pilargit12 commented Aug 2, 2024

@w8896699 what happened to this branch ? Do we need to keep it open?

Pilargit12 assigned w8896699

Contributor

Pilargit12 commented Sep 20, 2024 •

edited

Loading

I am doing a cleaning of tech doc branches that were never reviewed again, nor merged (maybe this is too late). @w8896699 or @ShellyXueHan can this be closed or branch deleted?

Pilargit12 closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet