Principles of SRE
The course is part of this learning path
This course provides an introduction to Site Reliability Engineering (SRE), including background, general principles, and practices. It also describes the relationship between SRE and DevOps. The content in this course will help prepare you for the Google “Professional Cloud DevOps Engineer” certification exam.
If you have any comments or feedback, feel free to reach out to us at firstname.lastname@example.org.
- Learn about Site Reliability Engineering (SRE)
- Understand its core vocabulary, principles, and practices
- Discover how to use SRE to implement DevOps principles
- Anyone interested in learning about Site Reliability Engineering and its fundamentals
- DevOps practitioners who want to understand the role of Site Reliability Engineer
- Engineers interested in obtaining the Google “Professional Cloud DevOps Engineer” certification
- A basic understanding of DevOps
- A basic understanding of the software development life cycle
So let's talk about what a Site Reliability Engineer is and what their responsibilities include. DevOps and SRE have a lot in common, but there are some differences. DevOps is a company-wide culture and it is supposed to be everyone's job. In contrast, Site Reliability Engineering is a specific job role. That means the old role of operator is replaced with another, the Site Reliability Engineer.
A Site Reliability Engineer is basically the result of asking a software engineer to design an operations team. This means that the new role requires experience in both software development as well as a strong knowledge of operations. Someone who acts as an SRE will spend about half of their time doing ops-related work, such as monitoring and responding to production issues, being on-call, or performing manual interventions. But the other half of their time will be spent on development tasks, such as building new features, scaling systems, or writing automation.
A Site Reliability Engineer views operations as a software problem, and uses software engineering approaches to solve issues. One key difference from the old operator role is that both SREs and developers share the responsibility of maintaining production. Developers are no longer allowed to write some code and then throw it over the wall expecting operations to figure it out and make it work. Instead, SREs build the tools that developers use to compile, test, and deploy their code. If something breaks, the tools written by the SRE team will detect and alert everyone. And when it comes time to fix the issue, developers and SREs work together to come up with a solution. So when an incident is detected, SREs help coordinate the response.
The person who declares the incident takes on the role of Incident Commander or IC. It is the IC's job to direct the high-level state of the incident. To help assist, the IC will assign two other roles: first, an Operations Lead or OL and second, a Communications Lead or CL. The OL and the CL both report back to the IC. The Operations Lead role exists to lead the team who will be investigating and ultimately resolving the issue. The engineers doing the actual work report their progress back to the OL. And while the IC and OL are working to mitigate and resolve the incident, the Communications Lead or CL is busy keeping everyone informed and answering questions.
Clear communication is critical, and it needs to be made a high priority. Do not wait until after the incident has ended. The CL should immediately establish clear channels of communication, provide regular updates to the response team as well as stakeholders, and handle any inquiries. Site Reliability Engineers don't just react to problems; they are proactive as well. SREs are heavily involved in the development process. In the design phase, SREs help establish best practices, identify architectural mistakes, and even help co-design parts of the service.
During development, SREs should be building the tools that everyone will use to eventually manage and maintain it, including monitoring and alerting. After deployment, SREs verify that the service is stable and performs as planned. And finally, once a service has been deprecated, SREs will help transition users from the old service to a new one, if available, as well as help clean up configurations and documentation.
Now, remember when I said that the first goal of DevOps is to reduce organizational silos? Site Reliability Engineering accomplishes this by involving SREs in development work and developers in operations work.
Daniel began his career as a Software Engineer, focusing mostly on web and mobile development. After twenty years of dealing with insufficient training and fragmented documentation, he decided to use his extensive experience to help the next generation of engineers.
Daniel has spent his most recent years designing and running technical classes for both Amazon and Microsoft. Today at Cloud Academy, he is working on building out an extensive Google Cloud training library.
When he isn’t working or tinkering in his home lab, Daniel enjoys BBQing, target shooting, and watching classic movies.