This course explores how and why organizations choose and can adopt site reliability engineering (SRE) principles and practices, the changes they'll need to address, as well as the risks and challenges faced when adopting SRE. By the end of this course you'll have a clear understanding of the organizational impact of SRE, the patterns for adoption, how to conduct blameless post-mortems, and finally, how SRE can be scaled should it be adopted within your organization.
If you have any feedback relating to this course, please contact us at support@cloudacademy.com.
Learning Objectives
- Explore the various ways that organizations can adopt SRE
- Define the site reliability engineer job role
- Understanding cultural changes with regards to SRE
- Learn about SRE for growing organizations
Intended Audience
-
Anyone interested in learning about SRE and its fundamentals
-
Software Engineers interested in learning about how to use and apply SRE within an operations environment
-
DevOps practitioners interested in understanding the role of SRE and how to consider using it within their own organization
Prerequisites
To get the most out of this learning path, you should have a basic understanding of DevOps, software development, and the software development lifecycle.
Welcome back! In this course, I'm going to discuss how and why organizations choose and can adopt SRE principles and practices, the changes they'll need to address, as well as the risks and challenges faced when adopting SRE. By the end of this course you'll have a clear understanding of the organizational impact of SRE, the patterns for adoption, how to conduct blameless post-mortems, and finally, how SRE can be scaled should it be adopted within your organization. Right, let's begin!
To begin with, let's review why organizations should embrace SRE. Typically, they do so to establish site reliability and an improved agility to scale economically. Organizations embrace SRE to provide increased service resilience. Nowadays, anyone and everyone can find out if and when your service goes down. Platforms such as downdetector, are constantly running. There is no place to hide when an outage occurs.
As more and more people conduct their business online, the impact and exposure of any outage increases. Social media can make outages go viral in a matter of minutes which can cause irreversible reputation and brand damage, ultimately effecting business performance and the bottom line. Organizations embrace SRE to minimize loss of revenue. According to Gartner, the average cost of service downtime is $5,600 per minute. Because there are so many differences in how businesses operate, downtime, at the low end, can be as much as $140,000 per hour, $300,000 per hour on average, and as much as $540,000 per hour at the higher end.
Downtime of services almost always results in some form of commercial loss, but there can be other more tangible impacts. For example, if a benefits or pensions service goes offline, then those dependent on it will encounter basic living problems. Being offline can also result in applied financial penalties. Some organizations embrace SRE just because it's cool. SRE, like a range of other modern frameworks and techniques, is going through the hype curve. You need to beware of the hype. If you're going to embrace SRE, then do it for the right reasons, because you want to improve your business performance.
Let's now consider the following question. What's driving you towards SRE? Your reasons should go beyond technical ones. Importantly, consider business benefits of SRE adoption. Moving on, patterns for SRE adoption. Consider who in your organization currently provides SRE services or capabilities? Considering this question may help you to identify existing in house SRE expertise. For example, consider the following. Who is responsible for uptime and availability? Who sets SLOs? Who is setting and watching SLIs? How is DevOps currently established? If at all. Who takes care of toil? Who sets direction around automation and tooling? And finally, is anyone talking about anti-fragility?
Typical SRE adoption options can be summarized as following:
Consulting. Many organizations start with the consulting model as a method of SRE adoption. They contract in experts to provide guidance and advice to service delivery teams. Typically, there is no hands-on involvement in this delivery, so there is no shared responsibility. And consultants don't normally do on call.
Embedded. The embedded model involves having SRE specialists more involved in what service delivery teams actually do, co-working on SRE and often development activities. Knowledge transfer then passes over to the team and there becomes shared responsibility for reliability. SRE specialists pick up initial on call activities but over time everyone takes their turn to provide on call support.
Platform. In the platform model the SRE team take care of the underlying platform, cloud or on-premise, that the site service runs on, along with typically owning the tooling stack that powers SRE automation. On-boarding to the platform, services typically have to standardize their tooling. The SRE team own the production platform and sometimes other pre-prod environments, and often provide first level support and on call for these aspects. The SRE team may seek to share responsibility to a certain level, e.g. platform components, but usually do not develop enough knowledge of the service to move to full shared responsibility.
Slice and dice. In this model the overall service is sliced and diced into different sections, with SRE owning some elements and the service delivery team owning others. A slice can include application and infrastructure components, often a key part of the overall service, for example payment processing. This model is similar to the platform model but allows SRE teams to develop more in-depth knowledge and understanding of the wider service, allowing some shared responsibility to develop. On call responsibility is still divided.
And finally, full SRE. Going full SRE involves all team members embracing SRE practices, everyone going on call, shared responsibility for site reliability from day one and backlogs that prioritize reliability on par with new features. When adopting SRE it's important to set realistic goals and expectations. Not everything will be achievable but neither should it. Focus on what is appropriate. Set an overall vision for SRE adoption, a so-called true north, which the organization should head towards. It's worthwhile taking some time to view the following Uber video on their adoption of SRE and the lessons they learned along the way, what worked and what didn't.
Let's move on and define the SRE job role. What are the responsibilities that an SRE engineer typically focuses on? Consider the following Google SRE job ad posted on the 11th of November 2019, which requires the applicants to be able to: one, engage in, and improve the whole lifecycle of services. Two, support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews. Three, maintain services once they are live by measuring and monitoring availability, latency and overall system health. Four, scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. And five, practice sustainable incident response and blameless post-mortems.
Incident response is an important part of SRE requiring management to ensure that it is minimally required. Being on call is a critical duty that operations and engineering teams must undertake. Incident response must be sustainable as per the Google job description previously reviewed. Being on call requires you to be familiar with the SLO numbers involved with a service being down.
Consider a service which has a three nines availability SLO. This requires all issues within a given month to be fixed inside 43 minutes. Keep in mind that this time includes issue identification, alerting, messaging, triage and patching or fixing. You can see why setting appropriate and realistic SLOs for a service is so important. Consider the following. How often should SREs be expected to be on call? Google advocates a 25% on-call rule alongside the 50% toil limit rule.
An example SRE roster may include the following. To avoid a single point of failure, have at least two SREs on call. Providing ongoing 24 by 7 support will typically require at least eight SREs. SREs are scheduled to be on call one week at least every month. Spreading on call across multiple sites can mean no night shifts. This is the following the sun method. Under load means having too many people on call that they do not get involved in any issues and therefore do not gain the wisdom of production. When it comes to action time you want to be as effective as possible.
Consider the following best practice checklist. One, allocated individuals. Giving organizational clarity around who is on call. Two, suitable devices. Providing a mechanism for receiving all relevant information. Three, alert delivery systems. Capturing and delivering the right information and any background context. Four, documented procedures. Minimizing the risk of on call response activities. And five, blameless post-mortems. Making sure that the same issue does not repeat.
To perform on call effectively, these are the kind of things organizations must have in place. Minimizing risk is very important as quite often SREs are called at unsociable hours and may not be thinking straight. Having documented recovery procedures makes fixing things easier. When and where possible use automation to replace on-call call outs. Self-healing services allow us to move away from traditional operator repair services.
Tools such as Kubernetes and AWS auto-scaling groups provide self-healing capabilities out of the box. Automation avoids the issue of impaired judgment for called-out operators. And self-healing actions can be reviewed in blameless post-mortems. Let's now dive deeper into the concept of blameless post-mortems. The following quote articulates the concept that failure is expected to happen. "So, failure happens. This is a foregone conclusion when working with complex systems."
Organizations can, and do, try to prevent and minimize failures through testing, monitoring, and automation but things will still fail. Don't deny this and don't blame individuals when things do go wrong. Instead, look at the events that happened leading up to the failure and try and work out what happened without any blame. This is central to the concept of blameless post-mortems.
Ground rules should be in place to decide when a post-mortem needs to take place. Not every alert or outage will require one, especially if there is little or minimal user impact. However, blameless post-mortems should take place when any of the following conditions occur. One, user-visible downtime or degradation beyond a certain threshold, e.g. SLO. Two, data loss of any kind. Three, on-call engineer intervention, for example a release rollback or rerouting of traffic, et cetera. Four, a resolution time beyond and above some threshold. And five, a monitoring failure, which usually implies manual incident discovery. The blameless post-mortem is all about establishing facts. There is usually several participants who look at failure from different angles, and all viewpoints are to be captured equally.
Performing a blameless post-mortem may be considered a major cultural change, therefore emphasize that they are to take place without fear of punishment or retribution. Organizations need to make their employees safe to talk about failure. An engineer who thinks they're going to be punished are disincentivized to give the correct details necessary to get an understanding of the true picture of the failure that happened. This lack of understanding of how the incident occurred all but guarantees that it will repeat, if not with the original engineer but maybe another one in the future.
As well as human inputs there will be automation feedback and monitoring inputs too. Data from log files and from pipelines can provide hard evidence of what just happened and when. High-trust organizations encourage good information flow, cross-functional collaboration, shared responsibilities, learning from failures, and new ideas. Organizations should become more performance based and therefore use and embrace failure as an opportunity to learn and improve. Borrowing from the agile movement the following quote emphases the need to set the scene for open and honest discussions during SCRUM retrospectives.
A key project feedback loop. This directive can also be used at the start of any blameless post-mortem activity. Creating safety means people are allowed to do more with the production environment without fear of repercussion. To be safe, engineers also need to be accountable when things go wrong. If an engineer causes an incident then the focus is less on blame and punishment, but instead on making the engineer accountable for their actions and pro-actively making incidents of that kind much less likely.
There is a need to balance the demands of DevOps for more continuous delivery and deployment with an understanding that engineers are going to be accountable for their issues when things go wrong. More engineers working within production often means greater throughput and value realization, however that throughput needs to be reliable. If and when something goes wrong then SREs may pick up the incident but the engineer will be a key part of the whole post-mortem process.
Incidents and failure may be viewed with a different lens. One viewpoint is blameful, while the other is blameless. After considering both perspectives presented here, which one would you side with? A blameless post-mortem has outputs, which can include any of the following. One, details of the incident or failure, such as summary, impact, trigger, detection, resolution, and/or participants. Two, a list of follow up actions to mitigate future chances of the same incident, or similar ones happening again. Three, lessons learned from the incident. Four, a timeline of what happened. And five, any supporting information. This is the kind of information that needs to be put on record after a post-mortem has taken place. A lot of this information is generated at the time of the incident and therefore should be captured by those involved.
As well as human inputs there will be automation feedback and monitoring inputs too. Data from log files and from pipelines can provide hard evidence of what just happened and when. Consider the following case story provided by Sage Group. This company is focused on automating incident management and doing so they've learned to accept their incidents are expected to happen and therefore, try and take away the admin pain allowing engineers to simply focus their energies on resolution.
Tooling can and does help with this. The Sage story provides a good example. Service Now is a popular service management platform. Slack is an online communication platform. Zoom is a web-based conferencing product. Sage has invested time into integrating these tools together. A full history of what has happened during any incident is recorded too, thus feeds into their post-mortem process. Next up let's talk about how SRE can be scaled within an organization.
As a quick reminder, Google didn't just invent SRE for the sake of it. Instead, they did it to solve a real-world problem which is how do you handle problems in a massively distributed systems environment and be able to handle growth in services, users, geographies and number of requests sent through it, all at a mind-blowing scale? Rolling out SRE practices and principles within your own organization can be challenging, however, there are certain key success factors that will help you along your way such as.
One, getting exec level support and buy-in. Two, allocated funding and budget to run the team. Three, good relationships across the delivery spectrum: engineers, testing, infrastructure, et cetera. And four, an organization that is growing. The first two items apply to any type of organizational change. There must be support and sponsorship at a very senior level. SRE involves the concept of shared ownership and shared responsibility. Having strong relationships between SRE and other teams is very important.
SRE is also key to organizations that are growing, either by user volume or transactional volume. An organization that is not growing will not face the kinds of challenges that SRE is trying to solve. This is why we talk about scaling SRE within organizations. An organization can experience growth in different dimensions. For example, consider the following. Platform growth. Large volumes of users, irregular data flows, and legacy-to-modern architectures. Scope growth. New products and/or services. And ticket growth. Volume of incidents, outages, requests, and/or toil.
SRE typically needs to scale because an organization changes across one or more of these dimensions. "Through engineering solutions SRE allows organizations to scale their services at a much greater rate than the scale of their organization." This quote provided by Google emphasizes the fact that as services grow we need to use better engineering approaches to support that growth. As customer volumes grow engineering solutions help us to scale services without the need to scale the organization itself.
What SRE approaches can be used to scale out platform growth? To answer this question, we have several automation capabilities which can be used individually or in combination to safely handle platform growth such as: one, automation techniques like auto scaling, containerization, and/or clustering. Two, flexible platforms such as public and private clouds. Three, non-structured NoSQL like databases, for example MongoDB. And four, as-a-service capabilities around build, deploy, test, and monitoring.
Scope growth can be handled by the following. SRE ownership of common tools and platforms. The platform SRE model, which other developments use. SRE expertise shifting left into development teams, the so-called embedded SRE model. And automating toil makes more SRE time available for development. Ticket growth can be handled through automation techniques.
As we saw earlier in the toil course, tickets are usually considered toil: low value, repetitive things. Automating away toil is a key element in concept of SRE. All right, we've almost finished. Before this course ends, consider performing the following exercise. Challenge yourself to consider how SRE could be embraced and adopted within your own organization. What adoption approach would you advocate? How would you and your teams deal with being on call? How would you expect SRE to scale over time? And finally, sketch out a new organization chart showing clearly the changes introduced through your SRE adoption.
Finally, consider the following case story given by the Department of Work and Pensions. "A major focus of SREs is the aspiration to never see the same issue twice, often using automation as a resolution. We spend a large amount of time on reducing human labor, sharing knowledge between teams, and creating a blameless environment." Here we can see that SRE is clearly helping the Department of Work and Pensions to manage and maintain their important business service. And that their end goal is ensuring the reliability of their service to their customers.
Okay, that completes this course. In this course, you learned about the organizational impact of SRE, patterns for adoption, how to conduct blameless post-mortems, and how SRE can be scaled within an organization. Okay, close this course and I'll see you shortly in the next one.
Jeremy is a Content Lead Architect and DevOps SME here at Cloud Academy where he specializes in developing DevOps technical training documentation.
He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 25+ years. In recent times, Jeremy has been focused on DevOps, Cloud (AWS, Azure, GCP), Security, Kubernetes, and Machine Learning.
Jeremy holds professional certifications for AWS, Azure, GCP, Terraform, Kubernetes (CKA, CKAD, CKS).