Principles of SRE
The course is part of this learning path
This course provides an introduction to Site Reliability Engineering (SRE), including background, general principles, and practices. It also describes the relationship between SRE and DevOps. The content in this course will help prepare you for the Google “Professional Cloud DevOps Engineer” certification exam.
If you have any comments or feedback, feel free to reach out to us at firstname.lastname@example.org.
- Learn about Site Reliability Engineering (SRE)
- Understand its core vocabulary, principles, and practices
- Discover how to use SRE to implement DevOps principles
- Anyone interested in learning about Site Reliability Engineering and its fundamentals
- DevOps practitioners who want to understand the role of Site Reliability Engineer
- Engineers interested in obtaining the Google “Professional Cloud DevOps Engineer” certification
- A basic understanding of DevOps
- A basic understanding of the software development life cycle
In order to better understand site reliability engineering or SRE, you first need to understand a little about its history as well as its relationship to DevOps. Historically, building a production system involved two different teams, developers and operators. Developers were responsible for updating and writing new software. Operators were responsible for deploying that software to production and monitoring it.
Now, operators were not programmers. They did not touch the code. Typically they didn't even look at it, but they did understand how to assemble all the software components and make them work together to produce a surface. They also knew how to scale and maintain everything. If anything went wrong, it was the operators who sprung into action to resolve the problem. Each role had a different set of skills, and focused on different priorities. These differences often led to conflict.
Developers would spend their days fixing bugs, adding new features and constantly evolving the code. They were focused on agility and always wanted to make bigger and more frequent software updates. Operators spent their time fixing production issues and generally trying to keep everything running smoothly. They were focused on stability and always wanted smaller and less frequent updates. It was pretty common for operators to try to implement procedures, to slow down the rate of change while developers would try to bypass or ignore them. It was from this conflict that the idea for DevOps was born.
DevOps is a set of practices, guidelines and culture that was designed to reduce the gap between software development and software operations. The basic idea being, if we can get these two groups to work together, we can increase the company's overall throughput and growth. In order to achieve this, DevOps established five goals. Number one, reduce organizational silos. The separation of development and operations meant that there was very little collaboration or cross training. At times these two groups were even working at cross purposes. Number two, accept failure as normal. Neither people nor systems are perfect. People who are scared of failure also tend to fear change, and a company that does not change will not grow. Number three, implement gradual changes. Big changes are harder, riskier and take longer to recover from in the event of failure. Smaller incremental changes are just the opposite. Easier, safer, and quicker to recover from. Number four, leverage tooling and automation. Manual work simply does not scale, and too much of it can be very costly. The right tooling and automation frees people up to do more interesting and valuable work. Number five, measure everything. Growth requires making the right changes. You need to constantly measure your performance to make sure you're making the right decisions.
Now, the problem with DevOps as I previously mentioned was that it is very broad and it does not explicitly define how to implement these goals. That is where site reliability engineering or SRE comes in. Site reliability engineering evolved at Google independently of the DevOps movement, but it happens to embody many of the same goals. It also has a much more prescriptive way of achieving those goals. You can think of DevOps as a philosophy and site reliability engineering as one specific implementation of that philosophy. So, if you want to implement DevOps practices on Google Cloud Platform, it is valuable to understand SRE.
Site reliability engineering implements that DevOps goals with the following practices. Number one, creating the site reliability engineer role. This is a new role that replaces operators, and it focuses on sharing the responsibility of production with developers. Number two, holding blameless postmortems. These are retrospective meetings held after incidents to learn what failed and how to prevent those failures from happening again. Number three, defining and enforcing an error budget. Budgeting your money results in less spending and ensuring that the right bills get paid. In much the same way, an error budget encourages smaller changes and ensures the right balance of growth and stability is maintained. Number four, identify and work to reduce toil. Site reliability engineering defines medial operational tasks as toil, and it provides mechanisms for measuring and reducing that. Number five, track service level metrics and goals. These include SLIs, SLOs, and SLAs.
The following sections will cover each of these five concepts in more detail.
Daniel began his career as a Software Engineer, focusing mostly on web and mobile development. After twenty years of dealing with insufficient training and fragmented documentation, he decided to use his extensive experience to help the next generation of engineers.
Daniel has spent his most recent years designing and running technical classes for both Amazon and Microsoft. Today at Cloud Academy, he is working on building out an extensive Google Cloud training library.
When he isn’t working or tinkering in his home lab, Daniel enjoys BBQing, target shooting, and watching classic movies.