title | slug | description | keywords | page_purpose | audience | author | content_owner | sort_order |
---|---|---|---|---|---|---|---|---|
SRE Guideline for Platform Shared Services |
sysdig-monitor-setup |
Service Golden Signal - monitoring standards and best practices that will be applied to Platform Shared Services. |
SRE, Sysdig, Sysdig monitor, SLI, monitoring, OpenShift monitoring, developer guide, team guide, team, configure |
Documented monitoring metrics and monitoring tools that will describe the approach to monitoring and alerting that will be applied to all platform-shared services and apps. |
developer, technical lead |
Billy Li |
Billy Li |
2 |
Last updated: September 26, 2023
SRE, or Site Reliability Engineering, plays a crucial role in making sure an application runs smoothly by rapidly restoring the system to its normal state. We can use software to see exactly how healthy an application or system is and fix any problems before they affect stakeholders
In this document, we'll look at the fundamental idea behind SRE and show you how to use it with the Registry application as an example.
- B.C. Plaform Services Product Registry
- Setting up SRE
- Service Level Agreement (SLA)
- Service Level Objective (SLO)
- Service Level Indicators (SLIs)
- Resources monitoring with Sysdig (Saturation)
- Using PromQL
- CPU
- Latency
- Traffic monitoring
- Errors
- Importance of Monitoring and Alerting in SRE
- Sysdig dashboard
- Uptime.com
- Runbook
- How to calculate error budget
- Conclusion
- Related pages
The registry is an application that lets teams ask for namespaces in OpenShift 4 (OCP4) clusters. Here's what you can do with the registry:
- Allowing teams to request the creation of new project namespaces in specific clusters
- Enabling teams to update project contact information, manage resource quotas, and handle other metadata
- Facilitating the request for access to various resources, including ACS, Vault, and Artifactory repositories
- Allowing both the platform services team and AG to manage and supervise project sets.
The technology stack for the registry comprises a React front-end, a Node.js backend, a MongoDB database, and an automation tool named "Provisioner."
SRE involves deploying, configuring, and monitoring the app. It also includes ensuring services in production are available, managing latency, handling changes, responding to emergencies, and managing capacity of services in production. To ensure optimal performance and reliability, we employ various tools and methodologies that adhere to SRE principles.
The customer should always be at the center of every aspect of your customer agreement. Even though an incident might involve addressing ten different issues on the back end, from the client's perspective, what truly matters is that the system operates as expected. Your SLAs and SLOs should reflect this reality. It's important to confine your commitments to high-level, user-facing functions and always use straightforward language in SLAs.
Based on these principles, we can establish some SLAs for the Registry:
- Normal users should be able to load the Registry dashboard successfully within 5 seconds.
- Admin users should be able to load the Registry dashboard successfully within 13 seconds.
- Approved product requests should be provisioned within an hour.
- The application should be available online 99% of the time.
- Updates to product requests should be processed within an hour.
SLO is a specified target value or a range of values for a service level that gets measured through an SLI. Google offers a valuable workbook written by Steven Thurgood, David Ferguson with Alex Hidalgo and Betsy Beyer to assist you in implementing SLOs. Towards the end of this document, we will also delve into calculating the Error Budget in more detail.
Determining what you want to assure your customers is all about deciding how dependable you want your service to be based on your customers' expectations. For instance, if your SLA specifies that customers should receive a response to their requests within 300 milliseconds, your SLO might set a goal for response times to be within 200 milliseconds. Choosing the right SLO can be a challenge.
Once we establish an SLA that we know will keep our users satisfied, the SLO becomes the minimum commitment we make. As a result, it's in our best interest to identify and address any issues before they breach our SLA, allowing us time to fix them. Breaking this commitment often carries consequences.
Once again, I'll use the Registry as an example, and we'll consider monthly periods:
- The retrieval of all product information on the dashboard should take less than 5 seconds
- Retrieving information for 30 products on the dashboard should take less than 2 seconds
- Web, API, and DB services should be operational 99.5% of the time
- The database should have a backup created every 30 minutes
- Automation jobs for OCP Project Set change requests should be completed within 40 minutes
In case we need to disrupt these objectives or schedule maintenance windows, we must communicate the reasons for doing so in the #internal-devops-registry channel.
The frequency of backups such as every 30 minutes, directly links to a system's Recovery Point Objective (RPO), which serves as a critical metric in disaster recovery and business continuity planning. RPO establishes the maximum allowable data loss, measured in time. It addresses the question: "How much data can we lose before it starts affecting our business operations?" The reason for configuring it at a 30-minute interval is due to the unique nature of the Registry app. The Registry typically "backs up" most of its crucial data in GitHub repositories, and the provisioner usually processes each request in less than 30 minutes.
Why is this Important? Choosing the right RPO is vital for business continuity planning. Here's why:
- With a 30-minute RPO: In case of data loss, the team can recover data up to the moment of the last backup, which is a maximum of 30 minutes before the incident.
- Without a defined RPO (or with infrequent backups): Without a clear RPO or with backups done infrequently, there's a substantial risk of losing a significant amount of data. This loss could lead to severe consequences for the business, ranging from financial setbacks to damage to its reputation.
Having a backup created every 30 minutes corresponds to an RPO of 30 minutes. This is a good fit for systems where a maximum of 30 minutes of data loss can be tolerated in the event of a disaster. The decision about the appropriate RPO (and, by extension, backup frequency) should be grounded in business requirements and the potential consequences of data loss.
SLI golden signals consist of request latency, availability, error rate, and system throughput. These metrics establish the criteria for determining the system's "health." It's essential to grasp the connection and difference between SLIs, SLOs, and SLAs.
There are multiple ways to measure an SLI, and each method has its own strengths and limitations. These methods are often called SLI Implementations. To illustrate, let's take the example of page-loading time as an SLI. This SLI can be put into practice using various approaches, including:
- Utilizing the latency field within the request log of the application server
- Making use of metrics that the application server directly provides
- Extracting metrics from a load balancer positioned in front of the application servers
- Implementing a black-box monitoring service that assesses how long it takes for our system to respond
- Incorporating code in the user's web browser to report how quickly the page loaded for them..
The registry has the following monitoring standards built based on those four aspects.
1.Dashboard Load Time for Normal Users:
- SLI: The registry will accurately and promptly retrieve 99.8% of product entries
- Reason: Accurate data is crucial for both users and systems that depend on the registry. Maintaining the highest level of data accuracy minimizes errors and fosters trust in the system
2.Response Time for Data Retrieval::
- SLI: The registry will complete 99.5% of data retrieval requests in less than 500 milliseconds
- Reason: Users expect a quick response when querying the registry. Swift data retrieval ensures user satisfaction and efficient downstream operations
3.Dashboard Load Time for Normal Users:
- SLI: The registry will successfully process 98% of approved product provisioning requests without errors
- Reason: Users rely on the system to seamlessly provision products. A high success rate ensures the reliability of the registry's provisioning functionality
And so on.
Performance Monitoring Tools: You can use tools like Prometheus, Grafana, or New Relic to continuously watch and display system performance metrics, including response times
Logging:Make sure that the Application system records the time it takes for each data retrieval request. Periodically analyze these logs or use log aggregation tools like the ELK Stack (Kibana, Logstash, etc.) to gain insights
Threshold Alerts: Create alerts to inform system administrators or engineers when response times exceed the defined threshold
For the Registry, you can utilize Sysdig and Uptime.com to gather SLI metrics and establish alerts.
Saturation provides a broad perspective on how the system is being used. It helps us understand how much more capacity the service can handle and when it's operating at its maximum capacity. Since many systems start deteriorating before reaching 100% utilization, we must also establish a reference point for an "ideal" utilization percentage. What level of saturation guarantees optimal service performance and availability for users?
We keep an eye on resources such as CPU, RAM, and storage to monitor system performance.
The Prometheus Query Language (PromQL) is the standard for querying Prometheus metric data. PromQL is designed to allow the user to select and aggregate time-series data. And building a dashboard in Sysdig is heavily reliant on PromQL. The PromQL language is documented at Prometheus Query Basics. To start monitroing with Sysdig, please read this documentation.
Get application CPU usage by using:
avg(avg_over_time(sysdig_container_cpu_cores_used{$__scope,kube_pod_label_app= "<YOUR_APP_LABEL_NAME>", kube_statefulset_label_app = '<YOUR_APP_LABEL_NAME>'}[$__interval]))
Get application requested CPU by using:
avg(avg_over_time(kube_pod_sysdig_resource_requests_cpu_cores{$__scope, kube_pod_label_app= "<YOUR_APP_LABEL_NAME>", kube_statefulset_label_app = '<YOUR_APP_LABEL_NAME>'}[$__interval]))
We can learn how many resources the application is using vs how much it requested, and get resource utilization based on that. CPU Used vs Requested(Utilization) where it the percentage between sysdig_container_cpu_cores_used
and kube_pod_sysdig_resource_requests_cpu_cores
.
sum(last_over_time(sysdig_container_cpu_cores_used{kube_cluster_name=~$Cluster,kube_namespace_name=~$Namespace, kube_deployment_label_app = "<YOUR_APP_LABEL_NAME>"}[$__interval])) / (sum(last_over_time(kube_pod_sysdig_resource_requests_cpu_cores{kube_cluster_name=~$Cluster,kube_namespace_name=~$Namespace, kube_deployment_label_app = "<YOUR_APP_LABEL_NAME>"}[$__interval])) ) * 100
CPU Used vs Limited (Threshold) where is the percentage between sysdig_container_cpu_cores_used
and kube_pod_sysdig_resource_limits_cpu_cores
. We can learn how much more resources are available for the application
sum(last_over_time(sysdig_container_cpu_cores_used{kube_cluster_name=~$Cluster,kube_namespace_name=~$Namespace}[$__interval])) / (sum(last_over_time(kube_pod_sysdig_resource_limits_cpu_cores{kube_cluster_name=~$Cluster,kube_namespace_name=~$Namespace}[$__interval])) ) * 100
Regarding Utilization, our goal is to maximize it as much as possible. However, achieving a consistent level of 80% or higher isn't always feasible. We'll make our best efforts to achieve this target while ensuring that the application meets other SLOs.
As for handling namespace limitations, we can establish an alert system. When utilization reaches 80%, it will send a notification via RocketChat, indicating the need to allocate additional resources since it's nearing capacity. We can then either allocate more resources or implement a horizontal auto-scaling approach based on the specific circumstances.
sum(last_over_time(sysdig_container_cpu_cores_used{kube_cluster_name=~"silver",kube_namespace_name=~"platform-registry-prod"}[10s])) / (sum(last_over_time(kube_pod_sysdig_resource_limits_cpu_cores{kube_cluster_name=~"silver",kube_namespace_name=~"platform-registry-prod"}[10s])) ) > 0.8
Similar to CPU, RAM monitoring will also focus on Limitations and Utilization.
Application Memory Usage can be get by query:
avg(avg_over_time(sysdig_container_memory_used_bytes{$__scope, kube_pod_label_app= "<YOUR_APP_LABEL_NAME>", kube_statefulset_label_app = '<YOUR_APP_LABEL_NAME>'}[$__interval]))
Application requested Memory can be retrieved by query:
avg(avg_over_time(kube_pod_sysdig_resource_requests_memory_bytes{$__scope, kube_pod_label_app= "<YOUR_APP_LABEL_NAME>", kube_statefulset_label_app = '<YOUR_APP_LABEL_NAME>'}[$__interval]))
Utilization will be calculated by the sysdig_container_memory_used_bytes
/kube_resourcequota_sysdig_requests_memory_used
, to achieve 80% or above is still desired.
sum(last_over_time(sysdig_container_memory_used_bytes{kube_cluster_name=~$Cluster,kube_namespace_name=~$Namespace, kube_pod_label_app= "<YOUR_APP_LABEL_NAME>", kube_statefulset_label_app = '<YOUR_APP_LABEL_NAME>'}[$__interval])) / (sum(last_over_time(kube_pod_sysdig_resource_requests_memory_bytes{kube_cluster_name=~$Cluster,kube_namespace_name=~$Namespace, kube_pod_label_app= "<YOUR_APP_LABEL_NAME>", kube_statefulset_label_app = '<YOUR_APP_LABEL_NAME>'}[$__interval]))) * 100
Threhold for the namespace will be the ration between sysdig_container_memory_used_bytes
and kube_pod_sysdig_resource_limits_memory_bytes
sum(last_over_time(sysdig_container_memory_used_bytes{kube_cluster_name=~$Cluster,kube_namespace_name=~$Namespace}[$__interval])) / (sum(last_over_time(kube_pod_sysdig_resource_limits_memory_bytes{kube_cluster_name=~$Cluster,kube_namespace_name=~$Namespace}[$__interval]))) * 100
Many services regard request latency, which measures the time it takes to provide a response to a request, as a crucial SLI. Other commonly used SLIs include the error rate, often expressed as a fraction of all incoming requests, and system throughput, typically measured in requests per second. These measurements are frequently aggregated. In other words, raw data is collected over a measurement window and then transformed into rates, averages, or percentiles.
Establish a standard for what constitutes "acceptable" latency rates. Then, keep an eye on the latency of successful requests compared to failed requests to assess system health. Monitoring latency across the entire system can assist in pinpointing underperforming services and enables teams to promptly detect incidents. Analyzing latency in error cases can also expedite the identification of incidents, allowing teams to respond swiftly.
Ideally, the SLI should directly measure a service level we're interested in. However, there are instances where we can only use a proxy measurement because obtaining or interpreting the desired metric might be challenging. In our situation, client-side latency is frequently a more user-focused metric, but we may only have the capability to measure latency at the server. Therefore, the amount of time it takes to handle requests is the metric that concerns us, and we can easily obtain this value by retrieving the maximum value of sysdig_connection_net_request_time
(which represents the average time to serve a network request).We can then establish an alert system based on your SLI and SLO. Typically, lower bound
should be less than or equal to SLI
, which in turn should be less than or equal toupper bound
.
For instance, if we aim to ensure that our response time remains under 3 seconds as our SLO upper limit, we will set our sysdig_connection_net_request_time
SLO for any request taking longer than 5 seconds to be serviced. This will trigger a notification, prompting us to take action, such as increasing resources or refining algorithms.
It’s a good idea to add metrics or metric labels that allow the dashboards to break down served traffic by status code (unless the metrics your service uses for SLI purposes already include this information). Here are some recommendations:
- For HTTP traffic, monitor all response codes, even if they don’t provide enough signal for alerting, because some can be triggered by incorrect client behavior.
- If you apply rate limits or quota limits to your users, monitor aggregates of how many requests were denied due to lack of quota.
Therefore, we will build our monitoring on
- Number of HTTP Requests
Using sysdig_connection_net_request_count to monitor The total number of network requests per second. Note, this value may exceed the sum of inbound and outbound requests because this count includes requests over internal connections. So you might want to select explicitly your application API pod with the label.
- Number of Sessions :
Using sysdig_connection_net_connection_total_count to monitor the avg/max open session for the application
- Transactions Speed
Using sysdig_connection_net_total_bytes to monitor the upload and download speed of the application.
We must monitor error rates throughout the entire system, including individual services. Whether these errors result from manually defined logic or explicit issues like failed HTTP requests, detecting them early allows us to improve our SLO compliance and minimize application downtime.
The rate of Failed Requests is very important, The number of errors encountered by network system calls, such as connect(), send(), and recv() at a specified time by using
sum(rate(sysdig_container_net_http_error_count))
When we observe a high number of errors, we should anticipate the possibility of a logic bug or API/DB connection issue. We aim to attain a 99.95% availability rate, so if the error rate reaches 5%, Sysdig will need to issue notifications.
It's also crucial to establish which errors are critical and which pose fewer risks. This assists teams in evaluating the actual health of the service from a customer's perspective and enables prompt action to address recurring errors.
Site Reliability Engineering (SRE) stresses the importance of preserving and enhancing the dependability, availability, and performance of applications and systems. A pivotal element within this practice is monitoring and alerting. monitoring and alerting.
Monitoring offers a real-time view of a system's health and performance. By constantly observing system metrics and logs, we gain insights into how the system responds to different conditions and workloads. This data is incredibly valuable, not only for spotting issues but also for proactive performance improvement and capacity planning.
Alerting, on the other hand, serves as the mechanism that notifies the relevant parties when something goes wrong. Effective alerting ensures that potential problems are flagged and addressed promptly, often before users even notice. In the world of SRE, quick response times can mean the difference between a minor glitch and a major outage.
Together, monitoring and alerting create a feedback loop that enables teams to maintain high service levels and meet their Service Level Objectives (SLOs). As we delve deeper into tools like "Uptime.com" and concepts like "Runbook," it's crucial to remember the foundational role that monitoring and alerting play in guaranteeing the reliability and resilience of our systems.
One thing that I recommended to start with, is to leverage what someone has already built. We can always use dashboard template that been pre-build by Sysdig dashboard team. And here is a video demonstrating how to set it up.
In the dashboard library, you can find dashboards tailored for different purposes. If you wish to edit a dashboard, select it, then click on the Copy to My Dashboards button at the top right to make it your own and modify the queries as needed. If you find a particular dashboard useful, you can click on the Star button in the top right corner. This will save it to your favorites, allowing you to access it quickly in the future.
Uptime.com is a comprehensive website monitoring platform. It provides real-time insights into your website's availability, performance, and functionality. By continuously checking websites from multiple locations around the globe, Uptime.com ensures that end-users have the best possible experience.
Key features of Uptime.com include:
- Availability Monitoring: Checks if your site is accessible and notifies you immediately if it detects any downtime.
- Performance Monitoring: Measures site speed, helping identify bottlenecks that could impact user experience.
- Domain Health Check: Monitors SSL certificates, domain expirations, and more to ensure the health and security of your domain.
- Transactional Tests: Simulates user paths and interactions on your site, ensuring critical processes like logins or shopping cart checkouts work flawlessly.
By integrating tools like Uptime.com into the SRE toolkit, teams can proactively address issues, ensuring optimal user experience and meeting established Service Level Objectives (SLOs).
We can set up a transactional test for the registry. This test continually loads the dashboard page to ensure all components are successfully loaded. Uptime.com provides insights into the uptime and downtime of your application and can display the percentage of uptime over a specified period. Additionally, it can send out notifications, enabling us to respond promptly.
To achieve 99.5% uptime, we have a daily allowance of just 7 minutes and 12 seconds for downtime. This is where RunBooks prove invaluable. In SRE, the objective is to automate as many processes as feasible. In the realm of cloud operations, Runbooks consist of a series of steps carried out by SREs to accomplish specific tasks. These tasks can encompass incident responses, cost management, addressing performance challenges, and more.
One of the primary duties of SREs is to respond to incidents. What actions might an SRE carry out during an outage? Here's a typical sequence:
- Trigger
- Troubleshooting
- Root cause analysis
- Fix
Runbooks can include embedded logic, such as if-else statements and loops, as well as other functionalities beyond being a mere list of actions. For example, they can incorporate features like waiting for a resource. Essentially, a Runbook offers instructions on the steps to be taken or the processes for an automated system to follow when an SLO is breached. This ensures that we adhere to our SLA and avoid violations.
For a simple example:
-
When there are a few requests to the database that fail or timeout:
- Execute a command in our backup-container to save the current data in our database.
- Allocate more resources to the primary service.
-
When the dashboard load time nears our SLO:
- Scale up and terminate the old pod.
- Notify the development team to assess the network status.
For more insights on automation runbooks, refer to this source in Xenonstack. We are also introducing runwhen on our platform to aid in this automation process.
Calculate the Error Budget:
Error Budget=1−SLO
For the example above where the SLO is 95%, the error budget would be:
Error Budget=1−0.95=0.05 or 5%
This means that the service can be "unreliable" or "down" for 5% of the time without violating the SLO.
For more infomation, I highly recommend reading this documentation.
In this document, we've covered several essential aspects of Site Reliability Engineering (SRE). From delving into monitoring in detail to implementing tools like Uptime.com and creating useful Runbooks, our focus is on ensuring the continuous and smooth operation of our systems.
Quick recap:
- Monitoring: It's not just about observing; it's our initial defense against issues. The sooner we notice something, the faster we can address it.
- Alerts: When things go awry, receiving an early alert makes a significant difference. This is where tools like Uptime.com come into play.
- Tools and techniques: Speaking of tools, they're not just fancy additions. They're crucial components that help us complete our tasks effectively.
Moving forward:
- Explore the Tools: If you haven't already, get hands-on experience with the tools we discussed. This will give you a better understanding of how they fit into the bigger picture.
- Share Feedback: If you try something out and it works (or doesn't), please share your experience on Rocketchat. We can work together to figure it out. The more we exchange knowledge, the better we become. Stay Informed: SRE is a rapidly evolving field. Keep an eye out for updates, new tools, and innovative techniques.
- Collaborate: Do you have questions or find yourself stuck? Reach out for assistance. We're a team, and often, two minds are better than one.