Getting Started With Site Reliability Engineering

Much has been written and discussed about SRE (Site Reliability Engineering) from what it is, how to do it, and how it’s the same (or different) as DevOps. Google coined the term, defined the profession, and wrote the book on it. Their “Site Reliability Engineering” book covers the ideas behind SRE and Google’s internal practices, which work well for them. Let’s put the Google specifics aside for a moment and instead focus on ideas, responsibilities, and objectives. Taking a step back from specific implementations, this post reviews the prerequisites required to bootstrap an SRE team that fits your organization. Before we dive into it, check out Cloud Academy’s Recipe for DevOps Success webinar in collaboration with Capital One and don’t forget to have a look at Cloud Roster, the job role matrix that shows you what kind of skills a DevOps Engineer should master to land their dream job. If you’re a company, we suggest reading The Four Tactics for Cultural Change in DevOps Adoption.

The What and Why Behind SRE

SRE is a way to build and run reliable production systems in increasingly complex technical environments. SRE acknowledges that running successful production systems is a specific skill that’s different than other engineering disciplines. Ben Treynor, the founder of the SRE team at Google, describes SRE responsibilities in an interview for the SRE book:

[the] SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.

Site Reliability Engineers require software development and operations skills. They’re expected to write software that assists with deployment and production operations and also debug software in production environments. A cursory look at SRE job posts shows new hires are expected to be fluent in a programming language (such as Go or Node.js), configuration management, and automation tools (such as Ansible, Chef, or Puppet) and cloud infrastructure (like AWS, Azure, or GCP). Experience with containers and container orchestration like Mesos or Kubernetes is a common job requirement too.

The interdisciplinary skill set is useful throughout the SDLC and overlaps with other technical team members. It may also cause SRE to become an organization’s junk drawer for work that doesn’t map clearly onto existing teams. It also means that these skills will be less effective if they’re not focused on clear goals and defined responsibilities.

Framing Responsibilities with SLOs

Generally, SRE’s goal is to promote system reliability and efficiency throughout the SDLC. Doing SRE well means tracking and assessing progress against metrics. Service Level Objectives (SLOs) are the entry point to reliability for many organizations and are provided in one way or another. They may already be written down, quantified and tracked, or they may be something as simple as an unspoken idea that the website must be up during work hours. SLO’s frame SRE’s operational work and they’re fundamental in doing so.

Stephen Thorne, SRE at Google, echoes this point in his talk titled “Getting Started with SRE” from the DevOps Enterprise Summit 2018.

You can’t run an effective site reliability engineering org unless you’re monitoring and reporting on your SLOs and actually worrying about the reliability of your system. It just doesn’t make any sense.

Setting measurable SLOs is the first checkpoint in getting started with SRE. Put bluntly, if your organization does not have written, measured, and reported SLO’s then it’s not ready for SRE. SLOs also need consequences since they’re worthless without enforcement. This provides SRE’s leverage to prioritize work that directly impacts SLOs as opposed to other work. The good news is that any organization can create and enforce SLOs (however they tend to carry different weight in a 2 person startup compared to a thousand strong enterprise organization).

SRE’s prime responsibility is ensuring their systems meet SLOs and many other things follow from that. That leads to the next question: how does SRE achieve this?

Day-to-Day SRE

SRE strives to reduce toil in their day to day work which continuously improves their efficiency and dependent teams. (Also note, that continuous improvement is a fundamental DevOps principle that connects SRE to the larger DevOps movement.) The SRE book defines toil as:

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. 1

It’s not work that engineers don’t want to do. Toil is an inhibitor that should be reduced in all possible areas. SRE’s should always maximize their automation skills to reduce manual work, enabling an SRE team to scale out while maintaining consistency across their systems. Reducing toil is a powerful idea since it expands out to capture the day to day work of maintaining logging and metric systems, standing up new services, reporting SLOs, and/or adding CICD pipelines to other systems. This is all important work that other teams need but someone has to set up for them.

Google found that capping their SRE time to 50% on toil and the other 50% on project work (such a driving improvement or supporting existing teams) was a key factor in successful SRE implementations. Capping the work sets a clear limit on how painful toil may be and it exposes a clear priority in addressing toil that habitually pushes against the limit. It also enforces the idea that SRE is more than just toil and encourages a shared responsibility model. If the SREs are overwhelmed with toil, then work can be distributed across other teams. This sheds load from the SRE team while exposing other engineers to the reality of running their own systems in production.

Capping toil is the second checkpoint in getting started with SRE. Stephen Thorne reiterates this point in the talk mentioned earlier:

if you’re not capping that toil and allowing them to actually go and implement that [monitoring] work, then all they’re doing is getting overloaded with toil and then they won’t be able to do any project work. The next time they need to do some things to improve the reliability of the system, they’re too overloaded. I think any org with one or a thousand SREs must be able to apply this principle. There must be this ability for the SREs to address the toil and do the project work.2

After these two checkpoints, it’s up to management and leadership to form teams and set responsibilities.

Moving Towards Site Reliability Engineering

When you have SLOs, a declared cap on toil and a plan to handle overflow, then it’s time to consider what SRE looks like for your organization. There are three common models:

  1. A centralized SRE team (like a Google)
  2. A decentralized SRE team
  3. SREs embedded in teams

There is no one correct answer. The best fit varies by organization size and specific goals. Consider a simple example. An 8 person team may not require a dedicated SRE, and it certainly doesn’t mandate a dedicated SRE team. Conversely, there’s an inflection point where a dedicated SRE team makes sense and embedding SRE into existing teams makes sense. You must consider the trade-offs before making a decision.

VictorOps see SRE differently. They consider SRE a behavior rather than a dedicated role. Their goal is to build a culture of reliability into their engineers instead of into a specific team. They accomplished this by building a cross-functional council. Here’s Jason Hand from VictorOps in the eBook “Build the Resilient Future Faster: Creating a Culture of Reliability“:

For VictorOps, the SRE mentality would need to be central to the culture of our entire organization. The responsibility of owning the scalability and reliability of the product (VictorOps) from a customer experience point of view doesn’t rest solely on an SRE team or individual engineer. Rather than assigning the SRE role and responsibility to a specific team or individual, we chose to assemble a cross-functional panel of engineers, support leads, and product representatives referred to as the SRE council.

VictorOps came to this conclusion by surveying SREs at other companies and determining what seemed right for them. You should do this before getting started with SRE since implementations of SRE ideas vary wildly between different organizations. There is no gold standard, just what’s effective for your organization and yielding results. Learning from other teams is a great way to avoid pitfalls.

Regardless of how SRE is structured within your organization, you’ll need buy-in from leadership and engineers. Management must enforce consequences for missed SLOs, breaching caps on toil, and defining clear boundaries between SRE and other teams. Introducing SRE can be a major organizational change and when so will only be successful if supported at the highest levels.

Next Steps

Let’s review the checkpoints we’ve established along the way to getting started with SRE. First and foremost is to establish, monitor, and report on SLOs. SLOs provides the foundation for building and maintaining reliable systems. Second is the cap on toil which ensures SREs are focused on continuous improvements throughout the system and not on low-value toil work. Lastly, there’s the collaborative effort of documenting responsibilities and building organizational buy-in.

Once you’re through these gates it’s time to consider the initial goals. Jason Hand, from VictorOps, poses a series of exercises. First, ask the team what keeps them up at night? The answer brings skeletons out of the closet. That kickstarts the process and allows new SREs to navigate their responsibilities while improving reliability.

  1. https://landing.google.com/sre/book/chapters/eliminating-toil.html ↩︎
  2. https://itrevolution.com/getting-started-with-sre-stephen-thorne-google/ ↩︎

Enjoyed this post? You might also like: 

 

Avatar

Written by

Adam Hawkins

Passionate traveler (currently in Bangalore, India), Trance addict, Devops, Continuous Deployment advocate. I lead the SRE team at Saltside where we manage ~400 containers in production. I also manage Slashdeploy.

Related Posts

Avatar
Adam Hawkins
— July 17, 2019

How to Become a DevOps Engineer

The DevOps Handbook introduces DevOps as a framework for improving the process for converting a business hypothesis into a technology-enabled service that delivers value to the customer. This process is called the value stream. Accelerate finds that applying DevOps principles of flow, f...

Read more
  • AWS
  • AWS Certifications
  • DevOps
  • DevOps Foundation Certification
  • Engineer
  • Kubernetes
Avatar
Adam Hawkins
— July 9, 2019

Top 20 Open Source Tools for DevOps Success

Open source tools perform a very specific task, and the source code is openly published for use or modification free of charge. I've written about DevOps multiple times on this blog. I reiterate the point that DevOps is not about specific tools. It's a philosophy for building and improv...

Read more
  • Ansible
  • Chef
  • configuration management
  • DevOps
  • devops tools
  • Docker
  • infrastructure-as-code
  • Kubernetes
  • telemetry
Avatar
Adam Hawkins
— July 2, 2019

DevOps: Scaling Velocity and Increasing Quality

All software teams strive to build better software and ship it faster. That's a competitive edge required to survive in the Age of Software. DevOps is the best methodology to leverage that competitive advantage, ultimately allowing practitioners to accelerate software delivery and raise...

Read more
  • continuous delivery
  • DevOps
  • software
Avatar
Adam Hawkins
— June 13, 2019

Continuous Deployment: What’s the Point?

Continuous Deployment is the pinnacle of high-performance software development. Continuous deployment teams deploy every commit that passes tests to production, and there's nothing faster than that. Even though you'll see the "CD" term thrown around the internet, continuous deployment a...

Read more
  • Development & Deploy
  • DevOps
Avatar
Adam Hawkins
— May 31, 2019

DevOps Telemetry: Open Source vs Cloud vs Third Party

The DevOps principle of feedback calls for business, application, and infrastructure telemetry. While telemetry is important for engineers when debugging production issues or setting base operational conditions, it is also important to product owners and business stakeholders because it...

Read more
  • Analytics
  • DevOps
Avatar
Adam Hawkins
— April 16, 2019

The Convergence of DevOps

IT has changed over the past 10 years with the adoption of cloud computing, continuous delivery, and significantly better telemetry tools. These technologies have spawned an entirely new container ecosystem, demonstrated the importance of strong security practices, and have been a catal...

Read more
  • DevOps
  • Security
Avatar
Adam Hawkins
— March 21, 2019

How DevOps Increases System Security

The perception of DevOps and its role in the IT industry has changed over the last five years due to research, adoption, and experimentation. Accelerate: The Science of Lean Software and DevOps by Gene Kim, Jez Humble, and Nicole Forsgren makes data-backed predictions about how DevOps p...

Read more
  • DevOps
  • Security
Avatar
Adam Hawkins
— February 7, 2019

Measuring DevOps Success: What, Where, and How

The DevOps methodology relates technical and organization practices so it's difficult to simply ascribe a number and say "our organization is a B+ on DevOps!" Things don't work that way. A better approach identifies intended outcomes and measurable characteristics for each outcome. Let'...

Read more
  • DevOps
Avatar
Adam Hawkins
— February 5, 2019

2019 DevOps and Automation Predictions

2019 DevOps and Automation Predictions We recently released our 2019 predictions for cloud computing and are doing the same here for DevOps and automation predictions. 2018 was a great year for software, and DevOps falls somewhere on the slope of enlightenment on the Gartner Hype Cy...

Read more
  • Cloud Predictions
  • DevOps
Avatar
Adam Hawkins
— January 17, 2019

Testing Through the Deployment Pipeline

Automated deployment pipelines empower teams to ship better software faster. The best pipelines do more than deploy software; they also ensure the entire system is regression-free. Our deployment pipelines must keep up with the shifting realities in software architecture. Applications a...

Read more
  • DevOps
Avatar
Adam Hawkins
— December 27, 2018

DevOps and Agile: Understanding the Relationship

Agile development used to be front and center in the conversation about software development. Now, DevOps has taken over the conversation. How do agile and DevOps relate? Both ideas began as ways to improve different aspects of software development. Agile embraced the changing nature of...

Read more
  • DevOps
Avatar
Adam Hawkins
— December 6, 2018

What DevOps Means for Risk Management

What Does DevOps Mean for Risk Management? Adopting DevOps makes the unfamiliar uneasy in two areas. One, they see an inherently risky choice between speed and quality and second, they are concerned that the quick iterations of DevOps may break compliance rules or introduce security vu...

Read more
  • DevOps