1. Home
  2. Training Library
  3. Google Cloud Platform
  4. Courses
  5. Site Reliability Engineering Principles on GCP



Course Introduction
History and Goals
Course Conclusion
Start course

This course provides an introduction to Site Reliability Engineering (SRE), including background, general principles, and practices. It also describes the relationship between SRE and DevOps. The content in this course will help prepare you for the Google “Professional Cloud DevOps Engineer” certification exam.

If you have any comments or feedback, feel free to reach out to us at support@cloudacademy.com.

Learning Objectives

  • Learn about Site Reliability Engineering (SRE)
  • Understand its core vocabulary, principles, and practices
  • Discover how to use SRE to implement DevOps principles

Intended Audience

  • Anyone interested in learning about Site Reliability Engineering and its fundamentals
  • DevOps practitioners who want to understand the role of Site Reliability Engineer
  • Engineers interested in obtaining the Google “Professional Cloud DevOps Engineer” certification


  • A basic understanding of DevOps
  • A basic understanding of the software development life cycle

In this next section, I will define what toil is and explain how it can be mitigated using automation. Earlier, I had mentioned that the third goal of DevOps is to leverage tooling and automation. A lot of traditional operations work was manual, repetitive and labor intensive. Common examples would include things like resetting passwords, responding to alerts, rolling out patches, restarting servers, and copying and pasting commands from a playbook. S.R.E calls this type of work toil.

Toil occurs every time an operator needs to manually touch a system during normal operations. Keep in mind that toil is not a synonym for boring or frustrating. Filling out an expense report may not be fun but that does not make it toil. Instead, toil is work that is tied to running a production service and tends to be manual repetitive, automatable, tactical, and devoid of long-term value. The amount of toil increases linearly as the service grows and if ignored can grow out of control until your entire team is consumed by it. Too much toil leads to career stagnation, boredom and burnout.

You can group S.R.E activities into four main categories: software engineering, systems engineering, overhead, and toil. Software engineer includes things like writing automation scripts, creating tools, or modifying infrastructure code to make it more robust. Systems engineering includes things like installing updates, server configuration, or load balancer setup. Overhead is administrative work that's not directly tied to running a service and includes things like conducting interviews, attending meetings and completing peer reviews. Toil is the work directly tied to running a service and it's repetitive, manual, et cetera.

The S.R.E discipline aims to reduce toil through automation. S.R.E is try to identify repeatable tasks and write programs to reproduce the work. This means creating things like scheduled jobs instead of manually running scripts, automated monitoring tools instead of manually monitoring and rebooting unresponsive servers, continuous integration and continuous deployment pipelines instead of manually testing and deploying new code. And auto-scaling infrastructure instead of manually provisioning new hardware.

While automation is extremely helpful, not every task is worth automating. But if a repetitive task can be automated, it probably should be. And once it has, you've successfully freed up more resources for future development efforts. Ideally, S.R.Es will spend about half their time or less on toil and the other half on reducing toil for themselves and others.

About the Author
Learning Paths

Daniel began his career as a Software Engineer, focusing mostly on web and mobile development. After twenty years of dealing with insufficient training and fragmented documentation, he decided to use his extensive experience to help the next generation of engineers.

Daniel has spent his most recent years designing and running technical classes for both Amazon and Microsoft. Today at Cloud Academy, he is working on building out an extensive Google Cloud training library.

When he isn’t working or tinkering in his home lab, Daniel enjoys BBQing, target shooting, and watching classic movies.