In an IT environment, people and processes are as important as software. The concept of site reliability engineering, pioneered by Google, applies aspects of software engineering to operations with the goal of creating software systems that are highly scalable and reliable. In this post, we’ll take a closer look at the role of the Site Reliability Engineer and some of the best practices for this role, including runbooks, incident response reports, postmortem reports, and root cause analysis.
What does a Site Reliability Engineer do?
Ben Treynor, the founder of Google’s Site Reliability Team defines site reliability engineering as “what happens when a software engineer is tasked with what used to be called operations.”
A Site Reliability Engineer is responsible for the availability, performance, monitoring, and incident response, among other things, of the platforms and services that our company runs and owns.
We must make sure that everything that goes to production complies with a set of general requirements like diagrams, dependencies of other services, monitoring and logging plans, backups and possible high availability setups.
Even when the software complies with all of the necessary requirements such as uncaught exceptions, hardware degradation, networking problems, high usage of resources, or slow responses from our services could happen at any time. We always need to be prepared and be ready to act.
Our effectiveness will be measured as a function of mean time to recover (MTTR) and mean time to failure (MTTF). In other words, we must have our services up and running again as quickly as possible, and we must avoid any subsequent failure for as long as possible.
The Importance of documenting incidents
Now that we have an idea of what a Site Reliability Engineer does, let’s look at a typical scenario that one might encounter.
It is 3 a.m. and as the on-call engineer, you receive an SMS alert that something is wrong with one of your platforms. Your company is losing money every minute that the platform is down, so you get up and connect to the company’s VPN and start to troubleshoot. This is not difficult if you know the platform or service that went down. It is relatively easy if you have been working in the company for a long time and have the knowledge to fix it. You will know where to start and what to do… right?
But what if you are new, or perhaps you are new to the platform, which is owned by someone else. What are you going to do?
What is a runbook?
One of the best resources for someone in this situation is a runbook.
Put simply, a runbook is a set of instructions of what to perform or check when something goes wrong with any given service, application or platform. Also known as an operational manual, a runbook is our guide for resolving an incident. For a site reliability engineer working on call, it can be the difference between resolving an issue in a few minutes or a few hours. A runbook will guide us through the procedures necessary for getting our service up and running again quickly (or within a reasonable amount of time).
Ideally, a runbook is written by developers, in the case of code, and followed by anyone that is on-call. This means that any runbook should exist before the new piece of software goes to production. Of course, this is, as I mentioned, the ideal case.
Runbooks can be written to troubleshoot infrastructures, hosts or any other services and platforms that are in use. Therefore, you can have a different runbook for any issue or service that requires a human to be fixed. For example, if our runbook can be converted into a script, let’s do that. Otherwise, we should document all of the steps required to handle an alert. In other words, let’s automate and make that serviceable to heal itself. However, we should also take into consideration that we must make the required analysis and then implement the changes to our service to prevent it happening again.
If you don’t have runbooks in your organization, you should at least be writing down every step performed in the process of troubleshooting an issue. Even if the steps taken did not eventually resolve the problem, it’s worthwhile to have a record of how you tried to solve the issue. In this way, you can learn from these mistakes down the road.
If your organization already has several runbooks, it’s likely that you will encounter a situation that was not considered at the beginning of the design or development, and you will have to create a new runbook for that case.
What is an incident response report?
Once an issue has been resolved, you’ll want to properly document the incident to ensure that it doesn’t happen again. An incident response report is meant to record everything that happened, from each step that you performed to all of the commands (both good and bad) that you executed. Did you follow a runbook or did you have to troubleshoot in a different way? What was the initial situation? Did we receive an alert or discover the issue another way? If we did receive an alert, was the message meaningful in terms of helping us troubleshoot the issue?
In addition to describing what happened, the incident response report should include all of the communications around resolving it. Who did we notify that the service was down? Who helped us? Who was affected by the incident? How serious was it, and for how long was the service down?
We need to document every single aspect of what we did to fix our service. This information will be used in our analysis to discover the root cause. When we have gathered all of this information and the root cause is determined, we will be able to make or request the required changes to make our platform more reliable. This will help us improve our mean time to fail and mean time to recover.
What is a postmortem report?
For a site reliability engineer, resolving the problem is only half of the job. We must make sure that it doesn’t happen again. Therefore, we will need to perform a root cause analysis. To properly perform this analysis we should have all of the information that would ideally be contained in an incident response report.
Like the incident response report, our postmortem report will contain a timeline of everything we did to fix the problem, a root cause analysis, corrective and preventative measures, and a section that describes the resolution and recovery of our service.
The root cause analysis can be a discussion or brainstorming session about what worked and what didn’t work when resolving the incident. The corrective and preventative measures will be tickets or tasks that we must perform to avoid or mitigate future incidents.
Finally, the resolution and recovery section will be filled out with technical information and possibly code snippets that we used while resolving the fault.
Monitoring and alerting
Our sample scenario above was prompted by an alert. However, the alert wouldn’t exist without a good monitoring process in place.
Monitoring and alerting are two essential processes for site reliability engineers. We must monitor every possible metric within our platform so that we have a precise understanding of our system’s health at all times. The monitoring plan must be created along with the system design, or with each service that we are going to support.
A common practice is to monitor specific metrics, set thresholds, and trigger alerts based on those thresholds. The lesson to learn here is to try to make software that can interpret the alerts and automatically heal our system, sending alerts only if human intervention is absolutely necessary. These alerts should contain a clear explanation of the issue, possible tasks to perform or try, and links to documentation and runbooks related to the service incurring the problem.
The book on Site Reliability Engineering defines three kinds of valid monitoring output:
Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.
No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.
As you can see, we don’t need to receive alerts every time a threshold has been exceeded, but only when we have to take action to fix the situation. I would highly recommend avoiding emails as your alert system. It just doesn’t work. According to the concept of alert fatigue, we tend to ignore email alerts because we receive too many, and the majority are not real emergencies. This will lead us to not only ignore non-important alerts but to miss the important ones completely. As a result, by the time we finally act, it may be too late.
We have been in the situation where changes are sent to production without our knowledge or worse, without following the guidelines set to deploy them. This is why a change management process is so important, and every developer must stick with the plan. Part of a site reliability engineer’s job is to set those rules, create the tools needed to automate all the processes, and facilitate the deployment and rollback of new services or changes to existing ones.
Part of the change management process is making sure that changes and any new services that will be deployed comply with a list of requirements. This should include:
- Monitoring plan
- Alerts runbooks
- Owners list
- High availability strategies
- Deployment and rollback processes
- Data retention and backups
If something fails, we will have all the documentation needed in order to handle the situation in the best way possible.
“Automate what you can, document what you can’t and have the wisdom to know the difference.”
Documentation is very important. I have learned that the hard way. Over the past five years, many of the systems that I have supported lacked documentation, which made it especially difficult to resolve new issues. Following the best practices outlined above for any site reliability engineer has been a major step forward in having a more reliable platform and has made the 3 a.m. wake up call a thing of the past.
Learn more about roles just like this on Cloud Roster.
New Content: Platforms, Programming, and DevOps – Something for Everyone
This month our team of expert certification specialists released three new or updated learning paths, 16 courses, 13 hands-on labs, and four lab challenges! New content on Cloud Academy You can always visit our Content Roadmap to see what’s just released as well as what’s coming soon....
New Content: Focus on DevOps and Programming Content this Month
This month our team of expert certification specialists released 12 new or updated learning paths, 15 courses, 25 hands-on labs, and four lab challenges! New content on Cloud Academy You can always visit our Content Roadmap to see what’s just released as well as what’s coming soon. Ja...
New Content: Get Ready for the CISM Cert Exam & Learn About Alibaba, Plus All the AWS, GCP, and Azure Courses You Know You Can Count On
This month our team of intrepid certification specialists released five learning paths, seven courses, 19 hands-on labs, and three lab challenges! One particularly interesting new learning path is Certified Information Security Manager (CISM) Foundations. After completing this learn...
New Content: AWS Terraform, Java Programming Lab Challenges, Azure DP-900 & DP-300 Certification Exam Prep, Plus Plenty More Amazon, Google, Microsoft, and Big Data Courses
This month our Content Team continues building the catalog of courses for everyone learning about AWS, GCP, and Microsoft Azure. In addition, this month’s updates include several Java programming lab challenges and a couple of courses on big data. In total, we released five new learning...
Using Docker to Deploy and Optimize WordPress at Scale
Here at Cloud Academy, we use WordPress to serve our blog and product/public pages, such as the home page, the pricing page, etc. Why WordPress? With WordPress, the marketing and content teams can quickly and easily change the look & feel and the content of the pages, without rein...
New Content: AWS Data Analytics – Specialty Certification, Azure AI-900 Certification, Plus New Learning Paths, Courses, Labs, and More
This month our Content Team released two big certification Learning Paths: the AWS Certified Data Analytics - Speciality, and the Azure AI Fundamentals AI-900. In total, we released four new Learning Paths, 16 courses, 24 assessments, and 11 labs. New content on Cloud Academy At any ...
New Content: Azure DP-100 Certification, Alibaba Cloud Certified Associate Prep, 13 Security Labs, and Much More
This past month our Content Team served up a heaping spoonful of new and updated content. Not only did our experts release the brand new Azure DP-100 Certification Learning Path, but they also created 18 new hands-on labs — and so much more! New content on Cloud Academy At any time, y...
Docker Image Security: Get it in Your Sights
For organizations and individuals alike, the adoption of Docker is increasing exponentially with no signs of slowing down. Why is this? Because Docker provides a whole host of features that make it easy to create, deploy, and manage your applications. This useful technology is especiall...
Constant Content: Cloud Academy’s Q3 2020 Roadmap
Hello — Andy Larkin here, VP of Content at Cloud Academy. I am pleased to release our roadmap for the next three months of 2020 — August through October. Let me walk you through the content we have planned for you and how this content can help you gain skills, get certified, and...
New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More
This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...
New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More
This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...
New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More
This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...