Monitoring & Service Level Indicators
This course explores the subject of monitoring and service Level Indicators and how both work together to allow you to measure and track whether stated service level objectives are being met or not. By the end of this course, you'll have a clear understanding of monitoring and SLIs, and how to apply both of them correctly within your own organization.
If you have any feedback relating to this course, please contact us at email@example.com.
- Understand the difference between service level indicators (SLIs) and service level objectives (SLOs)
- Understand the difference between monitoring and observability
- Learn how monitoring can be applied to your workloads
Anyone interested in learning about SRE and its fundamentals
Software Engineers interested in learning about how to use and apply SRE within an operations environment
DevOps practitioners interested in understanding the role of SRE and how to consider using it within their own organization
To get the most out of this learning path, you should have a basic understanding of DevOps, software development, and the software development lifecycle.
Link to the YouTube video referenced in this course: Implementing SRE practices on Azure: SLI/SLO deep dive - BRK4025
Welcome back. In this course, I'm going to review the subject of Monitoring and Service Level Indicators and how both work together to allow you to measure and track whether stated service level objectives are being met or not. By the end of this course, you'll have a clear understanding of monitoring and SLIs, and how to apply both of them correctly within your own organization.
Before I start reviewing monitoring and service level indicators, take some time to think about what monitoring you already have in place and what you're actually monitoring. For example, do you perform technical monitoring such as CPU utilization and/or disk space monitoring? Maybe your monitoring is less technical and more about what makes users happy.
Again, as an example, maybe the user doesn't care if disk space is running out, but instead they only care that there is a problem preventing them from completing their transaction. With these considerations in mind, how do they translate into SLIs? Which equates into, technically speaking, monitoring. And then onto SLOs, which, as we know, focus on the user experience. We need to specify our SLOs and then use the monitoring at our disposal to check our SLIs, to make sure that we are not breaching those particular SLOs.
Let's now begin our journey into SLIs. SLI stands for service level indicators and they are related to SLOs, service level objectives. SLIs are all about measurement. The quote provided here by plumbr.io, a real-time user monitoring platform, communicates that, "SLIs are the ways for engineers to communicate quantitative data about systems." A common thread running throughout this learning path is that there should be a single source of truth that everyone across the organization can trust.
Monitoring tools provide such a mechanism to collect and aggregate various data points, which are then used to derive and calculate SLI information. SLIs are things that need to be tracked and recorded, so we can check whether our SLOs are being met or not.
Let's now consider an example to solidify our understanding. In an earlier example, we decided that 99.9% of web requests per month should be successful. This was the service level objective. Following along, if there are 1 million web requests in a given month, then up to 1,000 of those can fail. This is the error budget. Therefore, in this example, the service level indicator is web requests, plural. So we need a way to track and record this data.
SLI measurement. While many numbers can function as an SLI, it is generally recommended to treat the SLI as the ratio of two numbers, the number of good events divided by the total number of events. Consider our previous example. This then equates to the number of successful HTTP web requests divided by the total number of HTTP requests. This calculation should be performed continually on current data, and of which allows us to aggregate it over time and cross check it against a stated service level objective.
Now, how do we actually go about collecting, performing and calculating SLI measurements? Many indicator metrics are naturally gathered on the server side. In this scenario, consider using a monitoring system such as Prometheus, or performing periodic log analysis. For example, consider analyzing HTTP 500 error responses as a fraction of all HTTP requests. Some service level indicators may also need client-side data collection. If you're not measuring behavior on the client side, you'll potentially miss a range of problems that affect end users but don't affect service-side metrics.
SLI measurement needs also to be time-bound in some particular way. The time horizon may vary depending on the organization and the SLO. For web requests per month, the time horizon is clear. SLOs such as, let's say, successful bank payments, may require a broader horizon if bank payments are only made once or twice per month. Mapping SLOs to SLIs is simple. We use monitoring tools to measure SLIs constantly, aggregating across suitable time periods.
Then, we use our calculated SLIs to tell us if we're meeting our defined SLOs or not. They also tell us how much of our error budget is left, if any. Take some time to review the following YouTube-hosted video, titled, "Implementing SRE practices on Azure". This presentation was given at the Microsoft Ignite conference. And, in particular, start the video from the 23rd 1/2 minute mark, where it emphasizes having the customer at the front and center of your SLO and SLI configurations. Let's now move on to monitoring.
Let's start by providing definitions for important terms often used and spoken of when talking about monitoring. Starting with monitoring itself, system monitoring is the use of a hardware or a software component to monitor the system resources and performance of a computer system. Next up, telemetry.
Telemetry is the highly automated communications process by which measurements are made and other data collected at remote or inaccessible points and transmitted to receiving equipment for monitoring. Wikipedia provides the following related quote, "In IT, we typically have a large number of end points, servers, network interfaces, applications, etc, where we need to collect data and then aggregate that data somewhere. This is why appropriate telemetry is needed."
Application performance management, APM, is the monitoring and management of performance and availability of software applications. APM strives to detect and diagnose application performance problems and to maintain an expected level of service. A takeaway from this is that there is an implicit link between the term level of service and the language used in SRE, e.g. service level objective and service level indicators.
SLOs define the level of service expected and SLIs show the level of service being received. When we lift up the cover on monitoring, we'll find various moving parts that collectively work together. In this conceptual diagram, the core engine is what coordinates monitoring activity and is where all of the monitoring data comes together. There are agents installed across all of the services and infrastructure that are subject to monitoring. The agents gather and pass the data to the core, typically in the form of log files, which may contain information such as data streams and events about what is happening within the service or infrastructure.
A UI, or user interface, provides a visual display of the current health of the overall environment, providing you with the ability to then drill down into various components of the service as required. Anomaly detection is where the rule of what is right and wrong is established. For example, CPU thresholds. Here, you may want your servers to run at less than 70% CPU utilization. If the rule is violated, then, an anomaly is detected. An alert is when an anomaly is detected and someone, perhaps a person, or something, perhaps a channel, is made aware. A person can then respond accordingly.
Graphing is where we can visually display data points across a longer period of time to show trends and usage. Busy periods, for example, graphing helps us to see potential problems, or quiet periods, where we can scale back resources. The following slide here shows common and popular SLI supporting tools currently available within the SLI marketplace ecosystem. Here, we can that there are both commercial and open source tools available for monitoring and tracking service level indicators.
Monitoring systems typically provide both the core and agents, which are deployed into an environment. Monitoring tools provide you with the ability to set thresholds and detect anomalies. Nagios, Prometheus and Catchpoint are popular examples. Graphing tools provide the trend analysis and visualization of data points. Grafana and Collectd are both examples. Logging tools typically accumulate logs or records from both services and infrastructure, and provide insights into how things are going.
An agent installed on a server can log activity at very quick intervals, perhaps milliseconds, or only when events take place. Collected logging data can then be aggregated by logging tools, like Logstash. Alerting tools, such as PagerDuty, take care of contacting people or systems when needed. When it comes to performing effective SLI monitoring, consider the following site reliability quote, given by LinkedIn, "We need to make sure that monitoring is effective without drowning in a sea of non-actionable alerts. The path to success is to instrument everything, but only monitor what truly matters."
The key takeaways from this quote are, as the scale and reach of your services and infrastructure grows, then so does the amount of required monitoring, to the point where we have an ocean of alerts. Capturing all this data is important but we need to apply some intelligence to what we really monitor and care about. This becomes what is considered observability, which we shall discuss very shortly.
Before we do so, consider the following case story quote provided by Trivago, a popular travel booking service. Trivago here emphasizes that their SLOs are focused on the user experience, and not the technology. This is a very important point. Previously, hotel search response times were taking too long as too many hotels were being returned per search. The number of search results is not a fixed number. Rather, for them, the number of hotel results per search is dynamic and changes depending on the performance of the current SLI.
For example, if response times are fast, then more search results are returned. However, if response times are slow, then fewer search results are returned. Trivago's monitoring and SLIs are widespread. They can spot performance issues beyond their own networks. For example, those which may reside within an ISP's network or within a content distribution network, CDN, anywhere in the world. If and when Trivago performs diagnostic or root cause analysis, then all collected monitoring data points are considered. And finally, Trivago established a feedback loop directly to developers to help them fine tune the system and allow it to converge towards the optimal solution.
All right, let's now move on and now discuss the related idea of observability.
When talking about service health at scale, you'll often hear the term observability used as an extension of monitoring. Monitoring is focused on things that we anticipate that will go wrong, creating thresholds of acceptability and alerts when they are breached. Observability extends on this idea, but is more focused on externalizing service data so that you an infer what the current state of that service is.
With observability, we don't respond to individual alerts since we're constantly observing to and reacting to the current state of the service. If, for example, the current state begins to degrade in any way, then we can proactively try and find problems, before they cause outages. Monitoring, on the other hand, waits for an outage to take place and then informs us that something has happened.
Reviewing the term monitor, we find that it is actually derived from a verb, an action, and means to observe. When it comes to monitoring, consider the following quote given by CA. "Monitoring is a verb, something we perform against our applications and systems to determine their state. From basic fitness tests and whether they're up or down, to more proactive performance health checks. We monitor applications to detect problems and anomalies." Extending the verb usage of monitoring, this quote clearly shows that we often monitor against things we know might go wrong, the so-called known-unknowns.
The term observable, on the other hand, is actually derived from a noun, a thing. Again, CA provides the following quote regarding observability, "Observability, as a noun, is a property of a system. It's a measure of how well internal states of systems can be inferred from knowledge of its external outputs. Therefore, if our IT systems don't adequately externalize their state, then even the best monitoring can fall short."
Extending the noun usage of observability, this quote shows that systems need to make enough data available to be observable to the outside world, in order to hopefully detect issues and anomalies that we can't plan for, the so-called unknown-unknowns.
Let's now consider the question, why observability is important. Monitoring has historically been performed at the application or component level. For example, is the application running or not? Or, is CPU usage spiking? When it comes to scaling services, various SLI challenges are encountered which monitoring alone cannot solve. For example, consider dynamic or auto-scaling architectures, transient containers that may only exist for fractions of a second, massive dependencies across interlaced services, all of which mean that monitoring at a discrete component level becomes more of a challenge and may, in fact, be the wrong thing to monitor. Observability addresses these particular challenges.
Observability is all about improved alerting. We want to improve alerting through SRE, by having user-facing SLOs that are then measured through SLIs. We can move alerting away from noisy, repetitive, technically-focused alerts to ones that are only triggered when the service level objective is being compromised.
Point one. Traditional monitoring tools set a static threshold for each metric. Every time that threshold is breached, you received one alert. This can add up to having too many alerts. An improved approach is to generate one alert for a group of metrics associated with a system or application.
Point two. Machine learning techniques can be applied to your monitoring data, to offer you an idea of how your environment normally performs, making it easier to separate true alerts from false ones. This is the normal state.
Point three. The final step to improving your signal-to-noise ratio is utilizing multi-criteria alerting. In short, by increasing the number of specific conditions you set on an alert rule, the less likely it is to trigger.
When describing what observability is, it's useful to state what it looks like. Distributed tracing is about getting data points for a service that spans multiple components and microservices. Event logging provides a standard, centralized way for applications and/or operating systems to record important software and hardware events. Internal performance data can be extracted from APM, application performance management tools, giving details of how the application itself is performing. User experiences involves understanding the user journey and what makes it successful. Fewer paging alerts, because we know what normal looks like for a service, and only alert if normal is degraded. And finally, we can also look to ask what if questions.
For example, if we were to power off a server, what would be the impact on the end user experience? Adding observability into the overall SRE equation and alongside both SLOs and SLIs, then the following picture emerges. SLOs or service level objectives must be set within the business and relate to the users' experience. SLIs provide the data on how we are performing and will both go up and down. Observability shows us the normal state of a service. Rather than react to individual monitors, we instead look at the impact on normal.
For example, a server outage may degrade performance by only a small amount but we still don't breach the SLO. Therefore, we don't need to call out for the server outage. Likewise, if we see the normal state decreasing over a period of time, then we should, in fact, investigate it proactively. We can even use failure testing to perform what-if analysis. For example, what if we were to lose two servers? And finally, we can also use the normal state of a service to amend our current SLOs.
In the example above, we may wish to change the target, as normally, it only takes 38 seconds. We could increase the percentage of users or decrease the time expected. Honeycomb, a dedicated tool for introspecting and interrogating distributed systems, provides the following quote describing their platform and product, "This rich ecosystem of introspection and instrumentation is not particularly biased towards the traditional monitoring stack's concerns of actionable alerts and outages."
Here, the takeaway from this statement is that observability is all about collecting the data points that allow you to ask the questions of the health of the overall service, proactively. Enabling you to check that things are okay, rather than waiting for it and reacting to alerts and outages. Consider now performing the following six-step exercise. In step one, consider a product or service that you have previously worked on.
Spend a bit of time mapping out the product or service user journeys. Think about what the user is trying to achieve and the workflow steps involved in doing it.
The second step involves determining which of the user journeys is most important.
Step three requires you to define what good looks like for your servers from a user's perspective. For example, if you're hosting a web service, good means your web service is available and fast. Or if your service provides a type of publishing platform, good may mean how fast your service publishes data or how fresh the data is.
Step four, now attempt to draw a high-level system diagram for the product or service in question. Don't get too detailed. Try and stay as high-level as possible. The diagram should show the major system components for the main user journey.
Step five, define potential SLIs and identify points in your service where you can measure them. These SLIs must reflect your user's definition of good captured earlier. Using your system diagram which you just created, show where, how, and what metrics can be collected. These metrics will form your SLIs. Then, mark out your SLIs over a period of time. For example, a moving hourly window where your SLIs show system performance for the previous hour.
And step six, the last step, think about moving from monitoring to observability. What are the user journeys, what constitutes normal, and how can you detect the overall health of the service? And what would you do if normal health is deteriorating?
Okay, that completes this course. In this course, you learned about what an SLI is and then how to use and leverage monitoring and observability to correctly measure the current performance and what normal is for a service, and whether it is meeting the defined SLOs or not.
Okay, close this course and I'll see you shortly in the next one.
Jeremy is a Content Lead Architect and DevOps SME here at Cloud Academy where he specializes in developing DevOps technical training documentation.
He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 25+ years. In recent times, Jeremy has been focused on DevOps, Cloud (AWS, GCP, Azure), Security, Kubernetes, and Machine Learning.
Jeremy holds professional certifications for AWS, GCP, Terraform, Kubernetes (CKA, CKAD, CKS).