Much has been written and discussed about SRE (Site Reliability Engineering) from what it is, how to do it, and how it’s the same (or different) as DevOps. Google coined the term, defined the profession, and wrote the book on it. Their “Site Reliability Engineering” book covers the ideas behind SRE and Google’s internal practices, which work well for them. Let’s put the Google specifics aside for a moment and instead focus on ideas, responsibilities, and objectives. Taking a step back from specific implementations, this post reviews the prerequisites required to bootstrap an SRE team that fits your organization. Before we dive into it, check out Cloud Academy’s Recipe for DevOps Success webinar in collaboration with Capital One and don’t forget to have a look at Cloud Roster, the job role matrix that shows you what kind of skills a DevOps Engineer should master to land their dream job. If you’re a company, we suggest reading The Four Tactics for Cultural Change in DevOps Adoption.
The What and Why Behind SRE
SRE is a way to build and run reliable production systems in increasingly complex technical environments. SRE acknowledges that running successful production systems is a specific skill that’s different than other engineering disciplines. Ben Treynor, the founder of the SRE team at Google, describes SRE responsibilities in an interview for the SRE book:
[the] SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
Site Reliability Engineers require software development and operations skills. They’re expected to write software that assists with deployment and production operations and also debug software in production environments. A cursory look at SRE job posts shows new hires are expected to be fluent in a programming language (such as Go or Node.js), configuration management, and automation tools (such as Ansible, Chef, or Puppet) and cloud infrastructure (like AWS, Azure, or GCP). Experience with containers and container orchestration like Mesos or Kubernetes is a common job requirement too.
The interdisciplinary skill set is useful throughout the SDLC and overlaps with other technical team members. It may also cause SRE to become an organization’s junk drawer for work that doesn’t map clearly onto existing teams. It also means that these skills will be less effective if they’re not focused on clear goals and defined responsibilities.
Framing Responsibilities with SLOs
Generally, SRE’s goal is to promote system reliability and efficiency throughout the SDLC. Doing SRE well means tracking and assessing progress against metrics. Service Level Objectives (SLOs) are the entry point to reliability for many organizations and are provided in one way or another. They may already be written down, quantified and tracked, or they may be something as simple as an unspoken idea that the website must be up during work hours. SLO’s frame SRE’s operational work and they’re fundamental in doing so.
Stephen Thorne, SRE at Google, echoes this point in his talk titled “Getting Started with SRE” from the DevOps Enterprise Summit 2018.
You can’t run an effective site reliability engineering org unless you’re monitoring and reporting on your SLOs and actually worrying about the reliability of your system. It just doesn’t make any sense.
Setting measurable SLOs is the first checkpoint in getting started with SRE. Put bluntly, if your organization does not have written, measured, and reported SLO’s then it’s not ready for SRE. SLOs also need consequences since they’re worthless without enforcement. This provides SRE’s leverage to prioritize work that directly impacts SLOs as opposed to other work. The good news is that any organization can create and enforce SLOs (however they tend to carry different weight in a 2 person startup compared to a thousand strong enterprise organization).
SRE’s prime responsibility is ensuring their systems meet SLOs and many other things follow from that. That leads to the next question: how does SRE achieve this?
SRE strives to reduce toil in their day to day work which continuously improves their efficiency and dependent teams. (Also note, that continuous improvement is a fundamental DevOps principle that connects SRE to the larger DevOps movement.) The SRE book defines toil as:
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. 1
It’s not work that engineers don’t want to do. Toil is an inhibitor that should be reduced in all possible areas. SRE’s should always maximize their automation skills to reduce manual work, enabling an SRE team to scale out while maintaining consistency across their systems. Reducing toil is a powerful idea since it expands out to capture the day to day work of maintaining logging and metric systems, standing up new services, reporting SLOs, and/or adding CICD pipelines to other systems. This is all important work that other teams need but someone has to set up for them.
Google found that capping their SRE time to 50% on toil and the other 50% on project work (such a driving improvement or supporting existing teams) was a key factor in successful SRE implementations. Capping the work sets a clear limit on how painful toil may be and it exposes a clear priority in addressing toil that habitually pushes against the limit. It also enforces the idea that SRE is more than just toil and encourages a shared responsibility model. If the SREs are overwhelmed with toil, then work can be distributed across other teams. This sheds load from the SRE team while exposing other engineers to the reality of running their own systems in production.
Capping toil is the second checkpoint in getting started with SRE. Stephen Thorne reiterates this point in the talk mentioned earlier:
if you’re not capping that toil and allowing them to actually go and implement that [monitoring] work, then all they’re doing is getting overloaded with toil and then they won’t be able to do any project work. The next time they need to do some things to improve the reliability of the system, they’re too overloaded. I think any org with one or a thousand SREs must be able to apply this principle. There must be this ability for the SREs to address the toil and do the project work.2
After these two checkpoints, it’s up to management and leadership to form teams and set responsibilities.
Moving Towards Site Reliability Engineering
When you have SLOs, a declared cap on toil and a plan to handle overflow, then it’s time to consider what SRE looks like for your organization. There are three common models:
- A centralized SRE team (like a Google)
- A decentralized SRE team
- SREs embedded in teams
There is no one correct answer. The best fit varies by organization size and specific goals. Consider a simple example. An 8 person team may not require a dedicated SRE, and it certainly doesn’t mandate a dedicated SRE team. Conversely, there’s an inflection point where a dedicated SRE team makes sense and embedding SRE into existing teams makes sense. You must consider the trade-offs before making a decision.
VictorOps see SRE differently. They consider SRE a behavior rather than a dedicated role. Their goal is to build a culture of reliability into their engineers instead of into a specific team. They accomplished this by building a cross-functional council. Here’s Jason Hand from VictorOps in the eBook “Build the Resilient Future Faster: Creating a Culture of Reliability“:
For VictorOps, the SRE mentality would need to be central to the culture of our entire organization. The responsibility of owning the scalability and reliability of the product (VictorOps) from a customer experience point of view doesn’t rest solely on an SRE team or individual engineer. Rather than assigning the SRE role and responsibility to a specific team or individual, we chose to assemble a cross-functional panel of engineers, support leads, and product representatives referred to as the SRE council.
VictorOps came to this conclusion by surveying SREs at other companies and determining what seemed right for them. You should do this before getting started with SRE since implementations of SRE ideas vary wildly between different organizations. There is no gold standard, just what’s effective for your organization and yielding results. Learning from other teams is a great way to avoid pitfalls.
Regardless of how SRE is structured within your organization, you’ll need buy-in from leadership and engineers. Management must enforce consequences for missed SLOs, breaching caps on toil, and defining clear boundaries between SRE and other teams. Introducing SRE can be a major organizational change and when so will only be successful if supported at the highest levels.
Let’s review the checkpoints we’ve established along the way to getting started with SRE. First and foremost is to establish, monitor, and report on SLOs. SLOs provides the foundation for building and maintaining reliable systems. Second is the cap on toil which ensures SREs are focused on continuous improvements throughout the system and not on low-value toil work. Lastly, there’s the collaborative effort of documenting responsibilities and building organizational buy-in.
Once you’re through these gates it’s time to consider the initial goals. Jason Hand, from VictorOps, poses a series of exercises. First, ask the team what keeps them up at night? The answer brings skeletons out of the closet. That kickstarts the process and allows new SREs to navigate their responsibilities while improving reliability.
- https://landing.google.com/sre/book/chapters/eliminating-toil.html ↩︎
- https://itrevolution.com/getting-started-with-sre-stephen-thorne-google/ ↩︎
Enjoyed this post? You might also like:
- Learning Path: DevOps Playbook – Moving to a DevOps Culture
- Learning Path: DevOps Fundamentals
- How DevOps Transforms Software Testing
- The Benefits of Cloud Containers
DevOps and Agile: Understanding the Relationship
Agile development used to be front and center in the conversation about software development. Now, DevOps has taken over the conversation. How do agile and DevOps relate? Both ideas began as ways to improve different aspects of software development. Agile embraced the changing nature of...
What DevOps Means for Risk Management
What Does DevOps Mean for Risk Management?Adopting DevOps makes the unfamiliar uneasy in two areas. One, they see an inherently risky choice between speed and quality and second, they are concerned that the quick iterations of DevOps may break compliance rules or introduce security vu...
How DevOps Transforms Software Testing
Testing is arguably the most important aspect of software development. Whether manual or automated, testing ensures the software works as expected. Broken software causes production outages, unsatisfied customers, refunds, decreased trust, or even complete financial collapse. Testing mi...
From Monolith to Serverless – The Evolving Cloudscape of Compute
Containers can help fragment monoliths into logical, easier to use workloads. The AWS Summit New York was held on July 17 and Cloud Academy sponsored my trip to the event. As someone who covers enterprise cloud technologies and services, the recent Amazon Web Services event was an insig...
Four Tactics for Cultural Change in DevOps Adoption
Many organizations approach digital transformation and DevOps adoption with the belief that simply by selecting and using the right tools, they will achieve higher levels of automation and gain massive efficiencies as a result. While DevOps adoption does require new tools and processes,...
Get Started with HashiCorp Vault
Ongoing threats of data breaches and cyber attacks remain top of mind for every team responsible for securing cloud workloads and applications, especially with the challenge of managing secrets including passwords, tokens, API keys, certificates, and more. Complexity is especially notab...
Open Source Software Security Risks and Best Practices
Enterprises are leveraging a variety of open source products including operating systems, code libraries, software, and applications for a range of business use cases. While using open source comes with cost, flexibility, and speed advantages, it can also pose some unique security chall...
What is Static Analysis Within CI/CD Pipelines?
Thanks to DevOps practices, enterprise IT is faster and more agile. Automation in the form of automated builds, tests, and releases plays a significant role in achieving those benefits and creates the foundation for Continuous Integration/Continuous Deployment (CI/CD) pipelines. However...
What is Chaos Engineering? Failure Becomes Reliability
In the IT world, failure is inevitable. A server might go down, an app may fail, etc. Does your team know what to do during a major outage? Do you know what instances may cause a larger systems failure? Chaos engineering, or chaos as a service, will help you fail responsibly.It almo...
10 Ingredients for DevOps Transformation with Mark Andersen
At Capital One, DevOps is about delivering high quality, working software, faster. This means software that is reliable, secure, usable, and performant while providing value and accomplishing those important end user goals. Everything is about speed of delivery and getting that feedback...
SQL Injection Lab: Think Like a Hacker
Security is IT’s top spending priority according to the 2017/2018 Computer Economics IT Spending & Staffing Benchmarks report*. Given the frequent changes and updates in vendor platforms, the pressure is on for IT teams who need to keep their infrastructures and data secure. As brea...
Women in Tech: Zamira Jaupaj, DevOps Engineer
In building an enterprise culture of cloud, DevOps skills complement the enterprise’s need to automate development, testing, deployment, and operations processes for their public cloud deployments. In this latest post in our Women in Tech series, we’ll be talking to Zamira Jaupaj, a Dev...