Getting Started With Site Reliability Engineering

Much has been written and discussed about SRE (Site Reliability Engineering) from what it is, how to do it, and how it’s the same (or different) as DevOps. Google coined the term, defined the profession, and wrote the book on it. Their “Site Reliability Engineering” book covers the ideas behind SRE and Google’s internal practices, which work well for them. Let’s put the Google specifics aside for a moment and instead focus on ideas, responsibilities, and objectives. Taking a step back from specific implementations, this post reviews the prerequisites required to bootstrap an SRE team that fits your organization. Before we dive into it, check out Cloud Academy’s Recipe for DevOps Success webinar in collaboration with Capital One and don’t forget to have a look at Cloud Roster, the job role matrix that shows you what kind of skills a DevOps Engineer should master to land their dream job. If you’re a company, we suggest reading The Four Tactics for Cultural Change in DevOps Adoption.

The What and Why Behind SRE

SRE is a way to build and run reliable production systems in increasingly complex technical environments. SRE acknowledges that running successful production systems is a specific skill that’s different than other engineering disciplines. Ben Treynor, the founder of the SRE team at Google, describes SRE responsibilities in an interview for the SRE book:

[the] SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.

Site Reliability Engineers require software development and operations skills. They’re expected to write software that assists with deployment and production operations and also debug software in production environments. A cursory look at SRE job posts shows new hires are expected to be fluent in a programming language (such as Go or Node.js), configuration management, and automation tools (such as Ansible, Chef, or Puppet) and cloud infrastructure (like AWS, Azure, or GCP). Experience with containers and container orchestration like Mesos or Kubernetes is a common job requirement too.

The interdisciplinary skill set is useful throughout the SDLC and overlaps with other technical team members. It may also cause SRE to become an organization’s junk drawer for work that doesn’t map clearly onto existing teams. It also means that these skills will be less effective if they’re not focused on clear goals and defined responsibilities.

Framing Responsibilities with SLOs

Generally, SRE’s goal is to promote system reliability and efficiency throughout the SDLC. Doing SRE well means tracking and assessing progress against metrics. Service Level Objectives (SLOs) are the entry point to reliability for many organizations and are provided in one way or another. They may already be written down, quantified and tracked, or they may be something as simple as an unspoken idea that the website must be up during work hours. SLO’s frame SRE’s operational work and they’re fundamental in doing so.

Stephen Thorne, SRE at Google, echoes this point in his talk titled “Getting Started with SRE” from the DevOps Enterprise Summit 2018.

You can’t run an effective site reliability engineering org unless you’re monitoring and reporting on your SLOs and actually worrying about the reliability of your system. It just doesn’t make any sense.

Setting measurable SLOs is the first checkpoint in getting started with SRE. Put bluntly, if your organization does not have written, measured, and reported SLO’s then it’s not ready for SRE. SLOs also need consequences since they’re worthless without enforcement. This provides SRE’s leverage to prioritize work that directly impacts SLOs as opposed to other work. The good news is that any organization can create and enforce SLOs (however they tend to carry different weight in a 2 person startup compared to a thousand strong enterprise organization).

SRE’s prime responsibility is ensuring their systems meet SLOs and many other things follow from that. That leads to the next question: how does SRE achieve this?

Day-to-Day SRE

SRE strives to reduce toil in their day to day work which continuously improves their efficiency and dependent teams. (Also note, that continuous improvement is a fundamental DevOps principle that connects SRE to the larger DevOps movement.) The SRE book defines toil as:

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. 1

It’s not work that engineers don’t want to do. Toil is an inhibitor that should be reduced in all possible areas. SRE’s should always maximize their automation skills to reduce manual work, enabling an SRE team to scale out while maintaining consistency across their systems. Reducing toil is a powerful idea since it expands out to capture the day to day work of maintaining logging and metric systems, standing up new services, reporting SLOs, and/or adding CICD pipelines to other systems. This is all important work that other teams need but someone has to set up for them.

Google found that capping their SRE time to 50% on toil and the other 50% on project work (such a driving improvement or supporting existing teams) was a key factor in successful SRE implementations. Capping the work sets a clear limit on how painful toil may be and it exposes a clear priority in addressing toil that habitually pushes against the limit. It also enforces the idea that SRE is more than just toil and encourages a shared responsibility model. If the SREs are overwhelmed with toil, then work can be distributed across other teams. This sheds load from the SRE team while exposing other engineers to the reality of running their own systems in production.

Capping toil is the second checkpoint in getting started with SRE. Stephen Thorne reiterates this point in the talk mentioned earlier:

if you’re not capping that toil and allowing them to actually go and implement that [monitoring] work, then all they’re doing is getting overloaded with toil and then they won’t be able to do any project work. The next time they need to do some things to improve the reliability of the system, they’re too overloaded. I think any org with one or a thousand SREs must be able to apply this principle. There must be this ability for the SREs to address the toil and do the project work.2

After these two checkpoints, it’s up to management and leadership to form teams and set responsibilities.

Moving Towards Site Reliability Engineering

When you have SLOs, a declared cap on toil and a plan to handle overflow, then it’s time to consider what SRE looks like for your organization. There are three common models:

  1. A centralized SRE team (like a Google)
  2. A decentralized SRE team
  3. SREs embedded in teams

There is no one correct answer. The best fit varies by organization size and specific goals. Consider a simple example. An 8 person team may not require a dedicated SRE, and it certainly doesn’t mandate a dedicated SRE team. Conversely, there’s an inflection point where a dedicated SRE team makes sense and embedding SRE into existing teams makes sense. You must consider the trade-offs before making a decision.

VictorOps see SRE differently. They consider SRE a behavior rather than a dedicated role. Their goal is to build a culture of reliability into their engineers instead of into a specific team. They accomplished this by building a cross-functional council. Here’s Jason Hand from VictorOps in the eBook “Build the Resilient Future Faster: Creating a Culture of Reliability“:

For VictorOps, the SRE mentality would need to be central to the culture of our entire organization. The responsibility of owning the scalability and reliability of the product (VictorOps) from a customer experience point of view doesn’t rest solely on an SRE team or individual engineer. Rather than assigning the SRE role and responsibility to a specific team or individual, we chose to assemble a cross-functional panel of engineers, support leads, and product representatives referred to as the SRE council.

VictorOps came to this conclusion by surveying SREs at other companies and determining what seemed right for them. You should do this before getting started with SRE since implementations of SRE ideas vary wildly between different organizations. There is no gold standard, just what’s effective for your organization and yielding results. Learning from other teams is a great way to avoid pitfalls.

Regardless of how SRE is structured within your organization, you’ll need buy-in from leadership and engineers. Management must enforce consequences for missed SLOs, breaching caps on toil, and defining clear boundaries between SRE and other teams. Introducing SRE can be a major organizational change and when so will only be successful if supported at the highest levels.

Next Steps

Let’s review the checkpoints we’ve established along the way to getting started with SRE. First and foremost is to establish, monitor, and report on SLOs. SLOs provides the foundation for building and maintaining reliable systems. Second is the cap on toil which ensures SREs are focused on continuous improvements throughout the system and not on low-value toil work. Lastly, there’s the collaborative effort of documenting responsibilities and building organizational buy-in.

Once you’re through these gates it’s time to consider the initial goals. Jason Hand, from VictorOps, poses a series of exercises. First, ask the team what keeps them up at night? The answer brings skeletons out of the closet. That kickstarts the process and allows new SREs to navigate their responsibilities while improving reliability.

  1. https://landing.google.com/sre/book/chapters/eliminating-toil.html ↩︎
  2. https://itrevolution.com/getting-started-with-sre-stephen-thorne-google/ ↩︎

Enjoyed this post? You might also like: 

 

Avatar

Written by

Adam Hawkins

Passionate traveler (currently in Bangalore, India), Trance addict, Devops, Continuous Deployment advocate. I lead the SRE team at Saltside where we manage ~400 containers in production. I also manage Slashdeploy.


Related Posts

Alisha Reyes
Alisha Reyes
— July 2, 2020

New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More

This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Alisha Reyes
Alisha Reyes
— June 11, 2020

New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More

This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Luca Casartelli
Luca Casartelli
— June 1, 2020

DevOps: Why Is It Important to Decouple Deployment From Release?

Deployment and release In enterprise organizations, releases are the final step of a long process that, historically, could take months — or even worse — years. Small companies and startups aren’t immune to this. Minimum viable product (MVP) over MVP and fast iterations could lead to t...

Read more
  • decoupling
  • Deployment
  • DevOps
  • engineering
  • Release
Luca Casartelli
Luca Casartelli
— May 14, 2020

DevOps Principles: My Journey as a Software Engineer

I spent the last month reading The DevOps Handbook, a great book regarding DevOps principles, and how tech organizations evolved and succeeded in applying them.As a software engineer, you may think that DevOps is a bunch of people that deploy your code on production, and who are alw...

Read more
  • DevOps
  • DevOps principles
Michael Dehoyos
Michael Dehoyos
— May 13, 2020

Linux and DevOps: The Most Suitable Distribution

Modern Linux and DevOps have much in common from a philosophy perspective. Both are focused on functionality, scalability, as well as on the constant possibility of growth and improvement. While Windows may still be the most widely used operating system, and by extension the most common...

Read more
  • DevOps
  • Linux
Avatar
Logan Rakai
— April 7, 2020

How to Effectively Use Azure DevOps

Azure DevOps is a suite of services that collaborate on software development following DevOps principles. The services in Azure DevOps are:Azure Repos for hosting Git repositories for source control of your code Azure Boards for planning and tracking your work using proven agil...

Read more
  • Azure
  • DevOps
Simran Arora
Simran Arora
— October 29, 2019

Docker vs. Virtual Machines: Differences You Should Know

What are the differences between Docker and virtual machines? In this article, we'll compare the differences and provide our insights to help you decide between the two. Before we get started discussing Docker vs. Virtual Machines comparisons, let us first explain the basics.  What is ...

Read more
  • Containers
  • DevOps
  • Docker
  • virtual machines
Avatar
Adam Hawkins
— October 24, 2019

DevOps: From Continuous Delivery to Continuous Experimentation

Imagine this scenario. Your team built a continuous delivery pipeline. Team members deploy multiple times a day. Telemetry warns the team about production issues before they become outages. Automated tests ensure known regressions don't enter production. Team velocity is consistent and ...

Read more
  • continuous delivery
  • continuous experimentation
  • DevOps
Avatar
Adam Hawkins
— September 13, 2019

How Google, HP, and Etsy Succeed with DevOps

DevOps is currently well developed, and there are many examples of companies adopting it to improve their existing practices and explore new frontiers. In this article, we'll take a look at case studies and use cases from Google, HP, and Etsy. These companies are having success with Dev...

Read more
  • Continuous Learning
  • DevOps
  • Velocity
Chris Gambino
Chris Gambino
— August 28, 2019

How to Accelerate Development in the Cloud

Understanding how to accelerate development in the cloud can prevent typical challenges that developers face in a traditional enterprise. While there are many benefits to switching to a cloud-first model, the most immediate one is accelerated development and testing. The road blocks tha...

Read more
  • deploy
  • deployment acceleration
  • development
  • DevOps
Avatar
Adam Hawkins
— August 9, 2019

DevSecOps: How to Secure DevOps Environments

Security has been a friction point when discussing DevOps. This stems from the assumption that DevOps teams move too fast to handle security concerns. This makes sense if Information Security (InfoSec) is separate from the DevOps value stream, or if development velocity exceeds the band...

Read more
  • AWS
  • cloud security
  • DevOps
  • DevSecOps
  • Security
Valery Calderón Briz
Valery Calderón Briz
— August 8, 2019

Understanding Python Datetime Handling

Communicating dates and times with another person is pretty simple... right?“See you at 6 o’clock on Monday” sounds understandable.But was it a.m. or p.m.? And was your friend in the same time zone as you when you said that? When we need to use and store dates and times on Pytho...

Read more
  • DevOps
  • Python
  • Python datetime
  • Unix timestamp