Principles of SRE
The course is part of this learning path
This course provides an introduction to Site Reliability Engineering (SRE), including background, general principles, and practices. It also describes the relationship between SRE and DevOps. The content in this course will help prepare you for the Google “Professional Cloud DevOps Engineer” certification exam.
If you have any comments or feedback, feel free to reach out to us at firstname.lastname@example.org.
- Learn about Site Reliability Engineering (SRE)
- Understand its core vocabulary, principles, and practices
- Discover how to use SRE to implement DevOps principles
- Anyone interested in learning about Site Reliability Engineering and its fundamentals
- DevOps practitioners who want to understand the role of Site Reliability Engineer
- Engineers interested in obtaining the Google “Professional Cloud DevOps Engineer” certification
- A basic understanding of DevOps
- A basic understanding of the software development life cycle
The fifth and final goal of DevOps is to measure everything. In order to be successful, you need to set goals. And you need a way to measure your progress towards meeting those goals. To help with this, Site Reliability Engineering has three defined metrics: SLOs, SLIs, and SLAs. First, I will talk about SLOs or Service Level Objectives and how you can use them to define success.
A Service Level Objective is a goal that your business aspires to meet and intends to take action to defend. The error budget for a service is directly related to the Service Level Objective. Your error budget represents the percentage of time that your service can be down, while your SLO represents the percentage of time that your service should be up. Here is a simple formula to help explain: Your error budget plus your SLO will equal 100%. An error budget of 2% implies an SLO of 98%.
SLOs provide a clear signal that your service is performing successfully. There is always some small amount of error in any system. Without clearly defined limits, you won't know if your current error rate is high enough to constitute a serious issue or not. You also won't be able to accurately prioritize improvements.
Services usually have multiple SLOs associated with them. And typical SLOs include things like availability, response time, and latency. For example, you might have an SLO that requires 99% of all web server responses to be non-500 errors. Or you might have an SLO that requires 95% of your home page requests to be served in under 200 milliseconds.
SLOs are not intended to define ideal, best-case performance. A good rule of thumb is that your SLOs should represent the lowest level of reliability that you can get away with. You can pick all kinds of different objectives, but not every objective is useful. SLOs need to be meaningful. Meeting your SLOs should result in happy users. Missing an SLO should result in unhappy users. Failing to meet an SLO can potentially have serious consequences: damaged reputations, drops in revenue, or even a loss of customers.
SLOs also need to be attainable, measurable, and repeatable. Picking an objective you cannot possibly achieve or reproduce is useless and just causes needless frustration. Also, SLOs need to be understandable and controllable. You need to know how to achieve your objectives, and have the ability to make the changes necessary to do so. Correctly setting and measuring service level objectives is a key aspect of the SRE role.
SLOs not only assist in measuring your success, but they can also be used to create powerful feedback loops. They show you which parts of your system needs improvement, and by how much. Thus, allowing you to easily identify trouble spots and prioritize work. By tracking your current performance versus your SLOs, you will get instant feedback on any changes and will be able to know with confidence what your team should be working on next.
Daniel began his career as a Software Engineer, focusing mostly on web and mobile development. After twenty years of dealing with insufficient training and fragmented documentation, he decided to use his extensive experience to help the next generation of engineers.
Daniel has spent his most recent years designing and running technical classes for both Amazon and Microsoft. Today at Cloud Academy, he is working on building out an extensive Google Cloud training library.
When he isn’t working or tinkering in his home lab, Daniel enjoys BBQing, target shooting, and watching classic movies.