The course is part of these learning paths
This course provides an introduction to Site Reliability Engineering (SRE), including background, general principles, and practices. It also describes the relationship between SRE and DevOps. The content in this course will help prepare you for the Google “Professional Cloud DevOps Engineer” certification exam.
If you have any comments or feedback, feel free to reach out to us at support@cloudacademy.com.
Learning Objectives
- Learn about Site Reliability Engineering (SRE)
- Understand its core vocabulary, principles, and practices
- Discover how to use SRE to implement DevOps principles
Intended Audience
- Anyone interested in learning about Site Reliability Engineering and its fundamentals
- DevOps practitioners who want to understand the role of Site Reliability Engineer
- Engineers interested in obtaining the Google “Professional Cloud DevOps Engineer” certification
Prerequisites
- A basic understanding of DevOps
- A basic understanding of the software development life cycle
A Service Level Agreement or SLA is a guarantee you make to your customers. It is a contract with consequences of failing to meet the SLOs they contain. SLOs and SLAs are similar. However, your SLAs should not be the same as your SLOs. Both are objectives, but an SLO is an internal objective, only used within the team. If the team fails to meet the SLO, then they may slow down deployments or hold a blameless postmortem.
SLAs violations are shared with your customers and usually require some sort of recompensation, such as a credit or refund. Also, they should not be the same because SLOs are supposed to be stricter than SLAs. You want to be notified of any problems and have a chance to address them well before they affect your customers. Ideally, the three metrics should exist on a spectrum. You want your SLIs to be higher than both your SLOs and SLAs. This means you are meeting your objectives and your service is performing as expected. You also want your SLOs to be higher than your SLAs. If your SLI drops below your SLO, you are in violation and need to take steps to resolve the issue. If your SLI drops below your SLA, you need to notify your customers and offer them compensation. It's better to break an internal objective than one that is visible to your customers.
SLAs need to be carefully set. Making them too high means you're more likely to violate them. Making them too low means your customers may feel less confident in your ability to deliver a quality service. SLAs that are too close to SLOs mean that you're less likely to catch a problem in time to prevent it. However, if your SLAs are too far away from your SLOs, that too can be a problem.
The SLO you run at tends to become the SLA everyone expects. So let's say you offered a 95% SLA, but were consistently delivering 99.99% for a long period of time. If your SLI then dropped to 98%, you might get customer complaints. You see, even though you only promised 95%, you proved that you can actually deliver higher. And now your customers have begun to depend upon this higher level of service. To avoid this issue, Google recommends adding extra downtime to services to prevent them from being overly available. As you can probably tell, picking the right SLOs and SLAs can be tricky. You may need to start tracking SLIs for a while, and then use the average to help define realistic SLOs and SLAs. In any case, Site Reliability Engineering recommends making sure they are all a part of your system requirements. And if you already have a production system but don't have them clearly defined, then that should be your highest priority.
Daniel began his career as a Software Engineer, focusing mostly on web and mobile development. After twenty years of dealing with insufficient training and fragmented documentation, he decided to use his extensive experience to help the next generation of engineers.
Daniel has spent his most recent years designing and running technical classes for both Amazon and Microsoft. Today at Cloud Academy, he is working on building out an extensive Google Cloud training library.
When he isn’t working or tinkering in his home lab, Daniel enjoys BBQing, target shooting, and watching classic movies.