Much has been written and discussed about SRE (Site Reliability Engineering) from what it is, how to do it, and how it’s the same (or different) as DevOps. Google coined the term, defined the profession, and wrote the book on it. Their “Site Reliability Engineering” book covers the ideas behind SRE and Google’s internal practices, which work well for them. Let’s put the Google specifics aside for a moment and instead focus on ideas, responsibilities, and objectives. Taking a step back from specific implementations, this post reviews the prerequisites required to bootstrap an SRE team that fits your organization. Before we dive into it, check out Cloud Academy’s Recipe for DevOps Success webinar in collaboration with Capital One and don’t forget to have a look at Cloud Roster, the job role matrix that shows you what kind of skills a DevOps Engineer should master to land their dream job. If you’re a company, we suggest reading The Four Tactics for Cultural Change in DevOps Adoption.
The What and Why Behind SRE
SRE is a way to build and run reliable production systems in increasingly complex technical environments. SRE acknowledges that running successful production systems is a specific skill that’s different than other engineering disciplines. Ben Treynor, the founder of the SRE team at Google, describes SRE responsibilities in an interview for the SRE book:
[the] SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
Site Reliability Engineers require software development and operations skills. They’re expected to write software that assists with deployment and production operations and also debug software in production environments. A cursory look at SRE job posts shows new hires are expected to be fluent in a programming language (such as Go or Node.js), configuration management, and automation tools (such as Ansible, Chef, or Puppet) and cloud infrastructure (like AWS, Azure, or GCP). Experience with containers and container orchestration like Mesos or Kubernetes is a common job requirement too.
The interdisciplinary skill set is useful throughout the SDLC and overlaps with other technical team members. It may also cause SRE to become an organization’s junk drawer for work that doesn’t map clearly onto existing teams. It also means that these skills will be less effective if they’re not focused on clear goals and defined responsibilities.
Framing Responsibilities with SLOs
Generally, SRE’s goal is to promote system reliability and efficiency throughout the SDLC. Doing SRE well means tracking and assessing progress against metrics. Service Level Objectives (SLOs) are the entry point to reliability for many organizations and are provided in one way or another. They may already be written down, quantified and tracked, or they may be something as simple as an unspoken idea that the website must be up during work hours. SLO’s frame SRE’s operational work and they’re fundamental in doing so.
Stephen Thorne, SRE at Google, echoes this point in his talk titled “Getting Started with SRE” from the DevOps Enterprise Summit 2018.
You can’t run an effective site reliability engineering org unless you’re monitoring and reporting on your SLOs and actually worrying about the reliability of your system. It just doesn’t make any sense.
Setting measurable SLOs is the first checkpoint in getting started with SRE. Put bluntly, if your organization does not have written, measured, and reported SLO’s then it’s not ready for SRE. SLOs also need consequences since they’re worthless without enforcement. This provides SRE’s leverage to prioritize work that directly impacts SLOs as opposed to other work. The good news is that any organization can create and enforce SLOs (however they tend to carry different weight in a 2 person startup compared to a thousand strong enterprise organization).
SRE’s prime responsibility is ensuring their systems meet SLOs and many other things follow from that. That leads to the next question: how does SRE achieve this?
SRE strives to reduce toil in their day to day work which continuously improves their efficiency and dependent teams. (Also note, that continuous improvement is a fundamental DevOps principle that connects SRE to the larger DevOps movement.) The SRE book defines toil as:
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. 1
It’s not work that engineers don’t want to do. Toil is an inhibitor that should be reduced in all possible areas. SRE’s should always maximize their automation skills to reduce manual work, enabling an SRE team to scale out while maintaining consistency across their systems. Reducing toil is a powerful idea since it expands out to capture the day to day work of maintaining logging and metric systems, standing up new services, reporting SLOs, and/or adding CICD pipelines to other systems. This is all important work that other teams need but someone has to set up for them.
Google found that capping their SRE time to 50% on toil and the other 50% on project work (such a driving improvement or supporting existing teams) was a key factor in successful SRE implementations. Capping the work sets a clear limit on how painful toil may be and it exposes a clear priority in addressing toil that habitually pushes against the limit. It also enforces the idea that SRE is more than just toil and encourages a shared responsibility model. If the SREs are overwhelmed with toil, then work can be distributed across other teams. This sheds load from the SRE team while exposing other engineers to the reality of running their own systems in production.
Capping toil is the second checkpoint in getting started with SRE. Stephen Thorne reiterates this point in the talk mentioned earlier:
if you’re not capping that toil and allowing them to actually go and implement that [monitoring] work, then all they’re doing is getting overloaded with toil and then they won’t be able to do any project work. The next time they need to do some things to improve the reliability of the system, they’re too overloaded. I think any org with one or a thousand SREs must be able to apply this principle. There must be this ability for the SREs to address the toil and do the project work.2
After these two checkpoints, it’s up to management and leadership to form teams and set responsibilities.
Moving Towards Site Reliability Engineering
When you have SLOs, a declared cap on toil and a plan to handle overflow, then it’s time to consider what SRE looks like for your organization. There are three common models:
- A centralized SRE team (like a Google)
- A decentralized SRE team
- SREs embedded in teams
There is no one correct answer. The best fit varies by organization size and specific goals. Consider a simple example. An 8 person team may not require a dedicated SRE, and it certainly doesn’t mandate a dedicated SRE team. Conversely, there’s an inflection point where a dedicated SRE team makes sense and embedding SRE into existing teams makes sense. You must consider the trade-offs before making a decision.
VictorOps see SRE differently. They consider SRE a behavior rather than a dedicated role. Their goal is to build a culture of reliability into their engineers instead of into a specific team. They accomplished this by building a cross-functional council. Here’s Jason Hand from VictorOps in the eBook “Build the Resilient Future Faster: Creating a Culture of Reliability“:
For VictorOps, the SRE mentality would need to be central to the culture of our entire organization. The responsibility of owning the scalability and reliability of the product (VictorOps) from a customer experience point of view doesn’t rest solely on an SRE team or individual engineer. Rather than assigning the SRE role and responsibility to a specific team or individual, we chose to assemble a cross-functional panel of engineers, support leads, and product representatives referred to as the SRE council.
VictorOps came to this conclusion by surveying SREs at other companies and determining what seemed right for them. You should do this before getting started with SRE since implementations of SRE ideas vary wildly between different organizations. There is no gold standard, just what’s effective for your organization and yielding results. Learning from other teams is a great way to avoid pitfalls.
Regardless of how SRE is structured within your organization, you’ll need buy-in from leadership and engineers. Management must enforce consequences for missed SLOs, breaching caps on toil, and defining clear boundaries between SRE and other teams. Introducing SRE can be a major organizational change and when so will only be successful if supported at the highest levels.
Let’s review the checkpoints we’ve established along the way to getting started with SRE. First and foremost is to establish, monitor, and report on SLOs. SLOs provides the foundation for building and maintaining reliable systems. Second is the cap on toil which ensures SREs are focused on continuous improvements throughout the system and not on low-value toil work. Lastly, there’s the collaborative effort of documenting responsibilities and building organizational buy-in.
Once you’re through these gates it’s time to consider the initial goals. Jason Hand, from VictorOps, poses a series of exercises. First, ask the team what keeps them up at night? The answer brings skeletons out of the closet. That kickstarts the process and allows new SREs to navigate their responsibilities while improving reliability.
- https://landing.google.com/sre/book/chapters/eliminating-toil.html ↩︎
- https://itrevolution.com/getting-started-with-sre-stephen-thorne-google/ ↩︎
Enjoyed this post? You might also like:
- Learning Path: DevOps Playbook – Moving to a DevOps Culture
- Learning Path: DevOps Fundamentals
- How DevOps Transforms Software Testing
- The Benefits of Cloud Containers
New Content: Azure DP-100 Certification, Alibaba Cloud Certified Associate Prep, 13 Security Labs, and Much More
This past month our Content Team served up a heaping spoonful of new and updated content. Not only did our experts release the brand new Azure DP-100 Certification Learning Path, but they also created 18 new hands-on labs — and so much more! New content on Cloud Academy At any time, y...
Docker Image Security: Get it in Your Sights
For organizations and individuals alike, the adoption of Docker is increasing exponentially with no signs of slowing down. Why is this? Because Docker provides a whole host of features that make it easy to create, deploy, and manage your applications. This useful technology is especiall...
Constant Content: Cloud Academy’s Q3 2020 Roadmap
Hello — Andy Larkin here, VP of Content at Cloud Academy. I am pleased to release our roadmap for the next three months of 2020 — August through October. Let me walk you through the content we have planned for you and how this content can help you gain skills, get certified, and...
New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More
This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...
New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More
This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...
New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More
This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...
DevOps: Why Is It Important to Decouple Deployment From Release?
Deployment and release In enterprise organizations, releases are the final step of a long process that, historically, could take months — or even worse — years. Small companies and startups aren’t immune to this. Minimum viable product (MVP) over MVP and fast iterations could lead to t...
DevOps Principles: My Journey as a Software Engineer
I spent the last month reading The DevOps Handbook, a great book regarding DevOps principles, and how tech organizations evolved and succeeded in applying them. As a software engineer, you may think that DevOps is a bunch of people that deploy your code on production, and who are alw...
Linux and DevOps: The Most Suitable Distribution
Modern Linux and DevOps have much in common from a philosophy perspective. Both are focused on functionality, scalability, as well as on the constant possibility of growth and improvement. While Windows may still be the most widely used operating system, and by extension the most common...
How to Effectively Use Azure DevOps
Azure DevOps is a suite of services that collaborate on software development following DevOps principles. The services in Azure DevOps are: Azure Repos for hosting Git repositories for source control of your code Azure Boards for planning and tracking your work using proven agil...
Docker vs. Virtual Machines: Differences You Should Know
What are the differences between Docker and virtual machines? In this article, we'll compare the differences and provide our insights to help you decide between the two. Before we get started discussing Docker vs. Virtual Machines comparisons, let us first explain the basics. What is ...
DevOps: From Continuous Delivery to Continuous Experimentation
Imagine this scenario. Your team built a continuous delivery pipeline. Team members deploy multiple times a day. Telemetry warns the team about production issues before they become outages. Automated tests ensure known regressions don't enter production. Team velocity is consistent and ...