This course provides an introduction to Site Reliability Engineering (SRE), including background, general principles, and practices. It also describes the relationship between SRE and DevOps. The content in this course will help prepare you for the Google “Professional Cloud DevOps Engineer” certification exam.
If you have any comments or feedback, feel free to reach out to us at support@cloudacademy.com.
Learning Objectives
- Learn about Site Reliability Engineering (SRE)
- Understand its core vocabulary, principles, and practices
- Discover how to use SRE to implement DevOps principles
Intended Audience
- Anyone interested in learning about Site Reliability Engineering and its fundamentals
- DevOps practitioners who want to understand the role of Site Reliability Engineer
- Engineers interested in obtaining the Google “Professional Cloud DevOps Engineer” certification
Prerequisites
- A basic understanding of DevOps
- A basic understanding of the software development life cycle
An SLO by itself is not very useful. You need to compare your objectives against your current performance. This is what a service level indicator, or SLI, is used for. SLIs are the metrics of your system tracked over time. Similar to SLOs, service level indicators are reported as percentages. SLIs range from zero to 100%, where zero means nothing works and 100% means everything is working perfectly.
The basic formula for calculating an SLI is the total number of good events divided by the total number of events multiplied by 100. So let's say you have an SLO that requires 95% of your homepage requests to be served in under 200 milliseconds. If your current SLI was only 94%, that would mean your service is performing below minimum expectations and that this problem needs to be fixed. If the SLI was actually 96%, then the service would be working as expected.
SLOs and SLIs allow you to quickly understand which services are performing well and which are experiencing problems. They also let you know how severe any detected problems are. SREs typically use monitoring tools or services to track and monitor SLIs. Just like your SLOs, your service level indicators should be focused on measuring the customer experience.
A good SLI should rise when customers are happy and fall when they are unhappy. If a metric can change and not significantly impact the customer experience, then it probably isn't worth tracking via SLI. Now there are four golden signals of monitoring, latency, traffic, errors, and saturation. If you can only measure a few metrics, focus on these four.
Latency tells you how quickly a certain percentage of requests can be fulfilled. Traffic tells you how much demand is being placed on your system. Errors tell you the rate of requests that fail, and saturation tells you how full your service is. There are many other types of SLIs and different systems use different types. Here are some examples. Typical SLIs for serving systems include availability, quality, and latency. Typical SLIs for data processing includes coverage, correctness, freshness, and throughput. And typical SLIs for storage systems include durability, throughput, and latency.
By setting the right SLOs and tracking the right SLIs, you create a clear path forward for success.
Daniel began his career as a Software Engineer, focusing mostly on web and mobile development. After twenty years of dealing with insufficient training and fragmented documentation, he decided to use his extensive experience to help the next generation of engineers.
Daniel has spent his most recent years designing and running technical classes for both Amazon and Microsoft. Today at Cloud Academy, he is working on building out an extensive Google Cloud training library.
When he isn’t working or tinkering in his home lab, Daniel enjoys BBQing, target shooting, and watching classic movies.