Service Level Objectives and Error Budgets
The course is part of this learning path
This course will broaden your knowledge of service level objectives (SLOs) and error budgets. An SLO is a goal for how well a product or service should operate. By the end of this course, you will have a clear understanding of both an SLO and an error budget and how the two of them are used together to balance service reliability With the pace of innovation.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Learn about service level objectives
- Understand how to calculate an error budget
Anyone interested in learning about SRE and its fundamentals
Software Engineers interested in learning about how to use and apply SRE within an operations environment
DevOps practitioners interested in understanding the role of SRE and how to consider using it within their own organization
To get the most out of this learning path, you should have a basic understanding of DevOps, software development, and the software development lifecycle.
Link to YouTube video referenced in the course: Risk and Error Budgets
Welcome back. In this course, I'm going to broaden your knowledge around service level objectives and error budgets. By the end of this course, you will have a clear understanding of both an SLO and an error budget and how the two of them are used together to balance service reliability With the pace of innovation.
Let's begin with answering the question. What is an SLO? An SLO, service level objective is a goal for how well a product or service should operate. SLOs are tightly related to the user experience. If one of the several SLOs are being met then the user will be happy and this is what we're striving to achieve. Sitting and measuring service level objectives is both an important and key aspect of the SRE role. When it comes to tracking SLOs, the most widely used one is availability. Products and services can have multiple SLOs. And in most cases, the more the better. So to summarize, SLOs are all about making the end user experience better.
Recalling from the first course, SLIs or service level indicators, are used to measure if we are hitting an SLO. A monitoring tool or service usually provides the data used for checking compliance against a particular SLO. SLOs are all about the business. Consider the following quote from the VP of operations at Evernote, "Before getting into the technical details of an SLO, it is important to start the conversation from your customer's point of view. What promises are you trying to uphold?"
Let's now consider a real-world service level objective example provided by Anchorfree. They decide that 99.9% of web requests per month should be successful. This is an example of a service level objective. Now, if there are one million web requests in a particular month, then up to 1000 of those are allowed to fail. This is an example of an error budget. Failure to hit an SLO must have consequences.
If, for example, more than 1000 web requests fail in a particular month, then some remediation work must take place. This is an example of an error budget policy. This particular example just given is fairly typical when talking about SLOs. In all cases, the number of web requests will vary. So basic math is used to work out the amount that are allowed to fail, the error budget. If the error budget is breached, then some engineering work needs to be prioritized to ensure the error budget is not breached in a future month. The remediation for example, could mean more infrastructure is deployed to handle the volumes of requests.
Let's now consider a second example. In this example, the service has an average login rate of 1000 per hour in a rolling 31-day period or month, or doing the math on this results in 744,000 individual logins. That is 31 times 24 times 1000. Now, we want 99% of logins each month to be successful. This again, is a service level objective. This equates to losing roughly 7,440 logins a month. This again is the error budget. If more than 7,440 logins are lost in any particular month, then we have breached the defined error budget. We use a service level indicator, SLI to tell us how many actual logins we are getting in a particular month.
If for example, in a particular month, the total number of actual logins were 726,560. This would mean that our error budget was exceeded. Again, failure to hit an SLO must have consequences. In this case, we instigate a business protection period, preventing new releases. This is the error budget policy.
Let's now consider a third non-tech SLO example. In this case it's all about managing service desk support tickets. In this example, it is proposed that 75% of support tickets should complete automatically. I.e with no manual effort or human involvement. When the SLO is breached and there is too many support tickets. Say for example, over 25% that require manual intervention, then the organization needs to prioritize engineering effort to put in place automation. This is the error budget policy.
For example, new user tickets, those which grant new users, access to a system or service may currently be handled manually involving manual user setup. If there are less then 250 of these per month then things are fine. However, if we go over 250 a month, then engineering effort is needed to automate the creation of new users. Lots of manual effort to deal with tickets is called Toil. The topic of toil will be addressed in the next course. Again, it needs to be emphasized that SLOs is and are a very key part of SRE.
Consider the following quote given by the co-Founder of Blameless, an SRE platform. SLOs are the most important component of SRE. Defining and establishing an SLO will refocus your organization on the right target. When it comes to considering the adoption of SLOs within the industry. The 2019 Catchpoint SRE survey indicated that 72% of respondents used availability as an SLO, 47% used response time, 46% used latency and interestingly 27% of organizations claim to be doing SRE, do not have any SLOs. So the question here is, how can they be doing SRE effectively?
Availability is the same as uptime. For example, does a service respond to a request? Response time is the total time it takes from when a user makes a request until they receive a response. Latency is the delay incurred in communicating a message. This is the time the message spends on the wire. Let's now move on and spend some time discussing error budgets. Consider the following quote given by the VP of 24 by Seven Engineering at Google, "100% is the wrong reliability target for basically everything." The key intention behind this quote is to show that systems and services are vulnerable and that the cost and effort involved in getting closer to 100% increases considerably. Allowing for some percentage points, gives a team an allowance to do necessary work.
Before we move on, again I'd like to draw your attention to the following YouTube-hosted video created by Google. It talks about risk and error budgets and is highly recommended to be reviewed. Now, when considering error budgets, there are both good and bad parts that need to be considered. If an error budget goes over budget then someone somewhere is having to work over-time and or respond to out-of-hour issues.
For example, if you're not hitting 99.9% of HTTP requests within any given month, this likely suggests scalability issues and therefore often Ops will need to do something. On the other hand SRE practices, encourage you to strategically burn the budget to zero every month, whether it's for feature launches or architectural changes. This way, you know you are running as fast as you can. Your velocity without compromising availability. Question, should error budgets be fixed? The negotiation to relax the SLO error budget bridges the gap and improves communication and understanding between Dev and Ops in the business. However, take care when negotiating your budgets. High risk deployments or large big bang changes have more likelihood of issues and therefore more chance of the error budget being blown. This should encourage the lean preference for small changes to stay within the error budget. And in some cases, the error budget may need to change to accommodate complex releases, but this needs to be agreed between Dev and Ops and the business. Let's now focus our discussion on error budget policies.
When organizations were surveyed as to how their businesses were impacted due to a missed SLO, 70% responded with a loss of revenue, 57% responded with a drop in employee productivity, 49% said loss of customers and 36% responded with some form of social media backlash. As expected, missing an SLO can and will have serious consequences for the business. As SLOs are from the user's perspective. The users are impacted when SLOs are breached and they can use all kinds of channels to communicate the impact, tarnish the reputation of the organization and ultimately affect the performance of the business.
The following quote provided here by Google gives an example of an error budget policy or consequence. In this case, it stated that, "There will be no new feature launches allowed. Sprint planning may only pull post-mortem action items from the backlog. The software development team must meet with the SRE team daily to outline their improvements."
In this next example, which focuses on availability, an organization sets an availability SLO of 99.9%. Therefore every month, this allows for approximately 43 minutes of outages, the error budget. New feature releases, patches, planned and unplanned downtime needs to be squashed into this 43 minutes. This example emphasizes how little time is available when setting a three nines, 99.9 target. If you were to increase this to four nines, that is 99.99, then this drops to 4.32 minutes per month. If you were to go with five nines for your SLO, the time window becomes so small that realistically there is no time for any downtime.
At this stage, pause here and consider the following question. What error budget policies would you use to enforce an availability SLO? To help you along some ideas that you may wish to consider. One, minimize deployment downtime, two, minimize infrastructure outages and three address scalability and or performance issues. Now with these considerations in mind, how could you achieve this? Perhaps, for example, you could use automation, the cloud, immutable infrastructure, load balancing caching, zero downtime and or A/B deployments.
In this case story quote provided here, the Home Depot SRE Director is communicating the important requirement for SLOs to be set across the business. To do so they, therefore, require the wide involvement from various stakeholders. Having accomplished this results in the following benefits. One, clearly understood SLOs across the organization, two, wider involvement in setting SLOs and three, a joint responsibility model across Dev and ops. Having become quite experienced in setting SLOs within their own organization. They created a catchy acronym, VALET, which I'll now discuss on the following slide.
When it comes to defining your own SLO, consider the acronym VALET, V-A-L-E-T. The letters that make up this acronym can be used to address different dimensions or perspectives of an SLO. V stands for volume, for example, traffic volume. A stands for availability, L stands for latency, E stands for errors and T stands for tickets. Each dimension has a distinctly different area of concern and has an appropriate SLO error budget and error policy established as seen here.
Now, consider screenshotting this or printing out this matrix. To help you out when you're formulating your own SLOs. Okay, that completes this course. In this course you learned about what a service level objective is, what an error budget is and what an error budget policy is. And you learned how each of these are used together to balance service reliability with the pace of innovation.
Okay, close this course and I'll see you shortly in the next one.
Jeremy is the DevOps Content Lead at Cloud Academy where he specializes in developing technical training documentation for DevOps.
He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 20+ years. In recent times, Jeremy has been focused on DevOps, Cloud, Security, and Machine Learning.
Jeremy holds professional certifications for both the AWS and GCP cloud platforms.