Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. This course introduces you to Site Reliability Engineering and takes you through its important features. It'll answer the fundamental question "What is Site Reliability Engineering?" before moving onto explaining the key differences between SRE and DevOps, and then finish by reviewing SRE principles and practices.
If you have any feedback relating to this course, please contact us at support@cloudacademy.com.
Learning Objectives
- Get acquainted with SRE and what it is
- Learn the six key principles of SRE
- Understand the differences between SRE and DevOps
Intended Audience
-
Anyone interested in learning about SRE and its fundamentals
-
Software Engineers interested in learning about how to use and apply SRE within an operations environment
-
DevOps practitioners interested in understanding the role of SRE and how to consider using it within their own organization
Prerequisites
To get the most out of this learning path, you should have a basic understanding of DevOps, software development, and the software development lifecycle.
Resources
Link to the YouTube video referenced in this course: What's the Difference between DevOps and SRE?
Welcome back, in this course I'm going to introduce you to Site Reliability Engineering and take you through its important features. I'll answer the fundamental question "What is Site Reliability Engineering?" I'll then move onto explaining the key differences between SRE and DevOps, and then finish by reviewing SRE principles and practices. Okay, lets begin!
So what is Site Reliability Engineering? Site Reliability Engineering created by Google about 2003 is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. Now to help with your understanding of Site Reliability Engineering, please consider a number of resources that Google has kindly published at the following location: https://landing.google.com/sre/.
On this landing page, you'll find a number of online resources and documentation that will help you to understand and broaden your knowledge about what SRE exactly is. Back to the question, what is Site Reliability Engineering? The key goal is to create ultra scalable and highly reliable distributed software systems. Someone who acts as an SRE will spend 50% of their time doing ops related work, such as issue resolution, being on call and or performing manual interventions.
SREs also spend 50% of their time on development tasks, such as building new features, scaling systems or writing more automation, even though Google was the first to really formalize the idea of Site Reliability Engineering. It has now spread well beyond Google many organizations other than Google are now running large scale services, embracing SRE to support them.
As this learning path continues, you'll hear about case stories from the following organizations. Another interesting view and perspective on SRE can be found online at the SRE weekly website found or located at https sreweekly.com. This particular website focuses more on the operations side of SRE and provides insights into scalability, availability, incident, response, and automation, all of which are very important to practicing site, reliability engineering.
Okay, let's move on and now consider the differences between SRE and DevOps. Before we get into our own discussion on the differences between DevOps and SRE, I want to draw your attention to the following YouTube hosted video. This particular video created by Google provides interesting insights into how they distinguish DevOps, and SRE. And I highly recommend you watch this. If we were to consider SRE from a programming or object orientated perspective, then you would consider SRE to be an implementation of DevOps.
So SRE is this specialization, DevOps is the generalization, DevOps itself is a set of practices, guidelines and culture designed to break down silos in IT development, operations, architecture network and security Site Reliability Engineering. On the other hand is considered a set of practices that Google have found to work of which come with beliefs that animate those practices and a job role.
So in summary, SRE is specifically a job role. Whereas DevOps goes beyond a job role, DevOps is everyone's job. To really hope and dive a little bit deeper into the key differences between SRE and DevOps. Let's now consider how Google views DevOps in terms of five key pillars of success and how each of them maps to SRE, one, reduce organizational silos, in SRE she is ownership with developers to create shared responsibility SREs use the same tools that developers use and vice versa, two accept failure is normal, SREs embrace risk and SRE quantifies failure and availability in a prescriptive manner.
Using Service Level Indicators SLIs and Service Level Objectives SLO and SRE mandates blameless postmortems, three implement gradual changes in SRE, just developers and product owners to move quickly by introducing small changes, thus reducing the cost of failure. Four, leverage tooling and automation. SREs have a charter to automate menial tasks often called toil. SREs use the same tools that developers use again, vice versa. And five, measure everything. And SRE defined prescriptive ways to measure value and SRE fundamentally believes the systems operations is a software problem. Observability introduces the concept of measuring the health of a service.
Let's now move on and consider the key principles and practices SRE Beginning with number one, operations is often considered a software problem. The basic tenet of SRE they're doing operations well is a software problem and SRE should therefore use software engineering approaches to solve their particular problem. And software engineering as a discipline focuses on designing and building rather than operating and maintaining in terms of statistics, estimates suggest that anywhere between 40% and 90% of the actual total cost of ownership is incurred after launch. Number two, service levels, a service level objective SLO is in availability target for a product or service. This is never 100% and in SRA services are managed to the SLO. So to be clear here, there are three important acronyms SLI, SLO and SLA. Let's go through each of these.
An SLI is a service level indicator, an indicator of the level of service that you are providing. For example, an HTTP requests success rate of say 99% and SLO or Service Level Objective specifies a target level for the reliability of your service. SLOs need consequences if they are violated and an SLA or Service Level Agreement is a business contract that comes into effect when your users are unhappy and you have to compensate them in some sort of fashion, number three toil, any end or manual mandated operational tasks shall be considered bad. If a task can be automated, then it should be automated.
Tasks can provide the wisdom of production that will inform beter system design and behavior SRE must have time to make tomorrow better than today. Number four, automation automate what is currently done, manually decide what to automate and how to automate it. Take an engineering based approach to problems rather than just toiling at the over and over. This should dominate what an SRE does and it shouldn't automate a bad process. Instead fix the process first, sometimes within Site Reliability Engineering, we talk about automating this year's job away. This however does not mean redundancy. The time we saved by doing this, we'll go into engineering, better products and services.
Number five, reduce the cost of failure. Late problem discovery is expensive. Therefore, an SRE looks for ways to avoid this look, to improve the MTTR. Meantime to repair smaller changes help with this and consider using Canary deployments SREs embrace the DevOps lean concept of smaller batch size Canary deployments reduce the risk of introducing a new software version into production by slowly rolling out the change to a smaller subset of users before rolling it out to everyone. Failure is an opportunity to improve.
Number six, shared ownership, SRES. Share skill sets with product development teams, boundaries between application development and production Dev and Ops should be removed. SRES shift left and provide wisdom of production to development teams. SRE encourages more engineers to have experience of production deployments, not lease no one team or individual shall become the Ops team.
In the following case story from Bloomberg consider the quote by the manager of which can be summarized providing the following SRE benefits, product stability improvements through team collaboration, client cost savings due to fewer outages and reduced daily grind, managing services and infrastructure through the use of more automation utilizing SRE within your own organization can really help to boost staff morale consider the following quote from the lead site, reliability engineer at Keener Security.
Okay, that completes this course. In this course I reviewed the question. What is site reliability engineering and gave a detailed answer. I then moved on to explaining the key differences between SRE and DevOps. I then finished by providing the six key SRE principles and practices. Okay, close this course. And I'll see you shortly in the next one.
Jeremy is a Content Lead Architect and DevOps SME here at Cloud Academy where he specializes in developing DevOps technical training documentation.
He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 25+ years. In recent times, Jeremy has been focused on DevOps, Cloud (AWS, Azure, GCP), Security, Kubernetes, and Machine Learning.
Jeremy holds professional certifications for AWS, Azure, GCP, Terraform, Kubernetes (CKA, CKAD, CKS).