This course provides an introduction to Site Reliability Engineering (SRE), including background, general principles, and practices. It also describes the relationship between SRE and DevOps. The content in this course will help prepare you for the Google “Professional Cloud DevOps Engineer” certification exam.
If you have any comments or feedback, feel free to reach out to us at support@cloudacademy.com.
Learning Objectives
- Learn about Site Reliability Engineering (SRE)
- Understand its core vocabulary, principles, and practices
- Discover how to use SRE to implement DevOps principles
Intended Audience
- Anyone interested in learning about Site Reliability Engineering and its fundamentals
- DevOps practitioners who want to understand the role of Site Reliability Engineer
- Engineers interested in obtaining the Google “Professional Cloud DevOps Engineer” certification
Prerequisites
- A basic understanding of DevOps
- A basic understanding of the software development life cycle
Next, I will talk about what an error budget is, what it is used for and how it can help provide direction for a team's future activities. Recall that the fourth goal of DevOps is to implement gradual change. The number one source of outages is change. Whether that is adding new features, applying security patches or deploying new hardware. Any of these things can potentially impact your uptime.
So the question is how do I balance the proper amounts of change and stability? Extremely high stability is expensive and can drive your innovation down and prices up, maybe beyond what your customers are willing to bear. But extremely high rates of change can drive your failure rates up and your customers elsewhere.
So what then is the right target of reliability for your system? This is an important question, but it isn't technical. It's really a question for the business. You need to understand things such as how much can the service fail before it begins to have a significant negative impact? How quickly do we need to be able to release new features? And what type and how many resources are available? These answers will most likely need to come from your product team as it will require an understanding of your users behavior, your business needs, and your product roadmap.
Once you've determined the right target for reliability, you can enforce it by using an error budget. An error budget works in a similar way to a monetary budget. A certain amount of errors or downtime is allocated to each service. As long as the number of errors or downtime of the service does not exceed the error budget, then the service is considered to be reliable enough.
So how exactly do you spend your error budget? Well, as failures increase, the error budget is consumed. If a service goes down and it's unresponsive that downtime is subtracted from the budget. Also, if the performance of a service drops beyond an acceptable threshold, the time it takes to restore performance is subtracted from the budget as well.
The longer your service is down or degraded, the more is subtracted from the error budget. And as your error budget shrinks, your team should respond by shifting resources away from adding new features and onto making more reliability improvements. This could include changing feature priority, reallocating developers to different projects or even delaying certain releases to a later time.
If failures continue to increase, your error budget might be in danger of being completely depleted. In this case, your team should hold all new deployments and focus completely on restoring service back to an acceptable range. At this point, you may end up with all of your S.R.S.Es and developers working a hundred percent on stability fixes and improvements.
Finally, once failures begin to decrease, your uptime will stabilize and your error budget will begin to replenish. It is at this point that your team can begin to shift resources back towards new development. So as you can see with an error budget, your team can commit to releasing features as quickly as a safe, where safe means staying within your budget.
Just like a monetary budget, picking the right amount is critical. Even small changes to an error budget can have a significant effect. Let's say that you wanted to ensure that a service was highly available and set the error budget at 0.01%. That would mean that your service could only be down for four and a half minutes per month. Now with a budget the small, your team is going to be mostly focused on reliability, changes will be limited. One small problem could consume the entire budget. However, if you're willing to accept a larger 1% error budget, your service can now be down for up to seven hours per month. This would allow for more frequent and riskier change.
Just like a monetary budget, the team is likely to spend the entire budget. So setting the budget too large could result in unnecessary downtime. Smaller budgets allow greater stability at the risk of slowing down the rollout of new features, while larger budgets allow the quicker release of features at the risk of longer and more frequent outages. No matter the size error budgets push teams towards making smaller, more gradual changes. A small deployment gone bad can be much more easily mitigated. A large deployment gone bad can exceed the budget, freeze development, and break schedules.
Daniel began his career as a Software Engineer, focusing mostly on web and mobile development. After twenty years of dealing with insufficient training and fragmented documentation, he decided to use his extensive experience to help the next generation of engineers.
Daniel has spent his most recent years designing and running technical classes for both Amazon and Microsoft. Today at Cloud Academy, he is working on building out an extensive Google Cloud training library.
When he isn’t working or tinkering in his home lab, Daniel enjoys BBQing, target shooting, and watching classic movies.