In an IT environment, people and processes are as important as software. The concept of site reliability engineering, pioneered by Google, applies aspects of software engineering to operations with the goal of creating software systems that are highly scalable and reliable. In this post, we’ll take a closer look at the role of the Site Reliability Engineer and some of the best practices for this role, including runbooks, incident response reports, postmortem reports, and root cause analysis.
What does a Site Reliability Engineer do?
Ben Treynor, the founder of Google’s Site Reliability Team defines site reliability engineering as “what happens when a software engineer is tasked with what used to be called operations.”
A Site Reliability Engineer is responsible for the availability, performance, monitoring, and incident response, among other things, of the platforms and services that our company runs and owns.
We must make sure that everything that goes to production complies with a set of general requirements like diagrams, dependencies of other services, monitoring and logging plans, backups and possible high availability setups.
Even when the software complies with all of the necessary requirements such as uncaught exceptions, hardware degradation, networking problems, high usage of resources, or slow responses from our services could happen at any time. We always need to be prepared and be ready to act.
Our effectiveness will be measured as a function of mean time to recover (MTTR) and mean time to failure (MTTF). In other words, we must have our services up and running again as quickly as possible, and we must avoid any subsequent failure for as long as possible.
The Importance of documenting incidents
Now that we have an idea of what a Site Reliability Engineer does, let’s look at a typical scenario that one might encounter.
It is 3 a.m. and as the on-call engineer, you receive an SMS alert that something is wrong with one of your platforms. Your company is losing money every minute that the platform is down, so you get up and connect to the company’s VPN and start to troubleshoot. This is not difficult if you know the platform or service that went down. It is relatively easy if you have been working in the company for a long time and have the knowledge to fix it. You will know where to start and what to do… right?
But what if you are new, or perhaps you are new to the platform, which is owned by someone else. What are you going to do?
What is a runbook?
One of the best resources for someone in this situation is a runbook.
Put simply, a runbook is a set of instructions of what to perform or check when something goes wrong with any given service, application or platform. Also known as an operational manual, a runbook is our guide for resolving an incident. For a site reliability engineer working on call, it can be the difference between resolving an issue in a few minutes or a few hours. A runbook will guide us through the procedures necessary for getting our service up and running again quickly (or within a reasonable amount of time).
Ideally, a runbook is written by developers, in the case of code, and followed by anyone that is on-call. This means that any runbook should exist before the new piece of software goes to production. Of course, this is, as I mentioned, the ideal case.
Runbooks can be written to troubleshoot infrastructures, hosts or any other services and platforms that are in use. Therefore, you can have a different runbook for any issue or service that requires a human to be fixed. For example, if our runbook can be converted into a script, let’s do that. Otherwise, we should document all of the steps required to handle an alert. In other words, let’s automate and make that serviceable to heal itself. However, we should also take into consideration that we must make the required analysis and then implement the changes to our service to prevent it happening again.
If you don’t have runbooks in your organization, you should at least be writing down every step performed in the process of troubleshooting an issue. Even if the steps taken did not eventually resolve the problem, it’s worthwhile to have a record of how you tried to solve the issue. In this way, you can learn from these mistakes down the road.
If your organization already has several runbooks, it’s likely that you will encounter a situation that was not considered at the beginning of the design or development, and you will have to create a new runbook for that case.
You can check a project I forked on Github with a runbook template.
What is an incident response report?
Once an issue has been resolved, you’ll want to properly document the incident to ensure that it doesn’t happen again. An incident response report is meant to record everything that happened, from each step that you performed to all of the commands (both good and bad) that you executed. Did you follow a runbook or did you have to troubleshoot in a different way? What was the initial situation? Did we receive an alert or discover the issue another way? If we did receive an alert, was the message meaningful in terms of helping us troubleshoot the issue?
In addition to describing what happened, the incident response report should include all of the communications around resolving it. Who did we notify that the service was down? Who helped us? Who was affected by the incident? How serious was it, and for how long was the service down?
We need to document every single aspect of what we did to fix our service. This information will be used in our analysis to discover the root cause. When we have gathered all of this information and the root cause is determined, we will be able to make or request the required changes to make our platform more reliable. This will help us improve our mean time to fail and mean time to recover.
What is a postmortem report?
For a site reliability engineer, resolving the problem is only half of the job. We must make sure that it doesn’t happen again. Therefore, we will need to perform a root cause analysis. To properly perform this analysis we should have all of the information that would ideally be contained in an incident response report.
Like the incident response report, our postmortem report will contain a timeline of everything we did to fix the problem, a root cause analysis, corrective and preventative measures, and a section that describes the resolution and recovery of our service.
The root cause analysis can be a discussion or brainstorming session about what worked and what didn’t work when resolving the incident. The corrective and preventative measures will be tickets or tasks that we must perform to avoid or mitigate future incidents.
Finally, the resolution and recovery section will be filled out with technical information and possibly code snippets that we used while resolving the fault.
Monitoring and alerting
Our sample scenario above was prompted by an alert. However, the alert wouldn’t exist without a good monitoring process in place.
Monitoring and alerting are two essential processes for site reliability engineers. We must monitor every possible metric within our platform so that we have a precise understanding of our system’s health at all times. The monitoring plan must be created along with the system design, or with each service that we are going to support.
A common practice is to monitor specific metrics, set thresholds, and trigger alerts based on those thresholds. The lesson to learn here is to try to make software that can interpret the alerts and automatically heal our system, sending alerts only if human intervention is absolutely necessary. These alerts should contain a clear explanation of the issue, possible tasks to perform or try, and links to documentation and runbooks related to the service incurring the problem.
The book on Site Reliability Engineering defines three kinds of valid monitoring output:
Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.
No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.
As you can see, we don’t need to receive alerts every time a threshold has been exceeded, but only when we have to take action to fix the situation. I would highly recommend avoiding emails as your alert system. It just doesn’t work. According to the concept of alert fatigue, we tend to ignore email alerts because we receive too many, and the majority are not real emergencies. This will lead us to not only ignore non-important alerts but to miss the important ones completely. As a result, by the time we finally act, it may be too late.
We have been in the situation where changes are sent to production without our knowledge or worse, without following the guidelines set to deploy them. This is why a change management process is so important, and every developer must stick with the plan. Part of a site reliability engineer’s job is to set those rules, create the tools needed to automate all the processes, and facilitate the deployment and rollback of new services or changes to existing ones.
Part of the change management process is making sure that changes and any new services that will be deployed comply with a list of requirements. This should include:
- Monitoring plan
- Alerts runbooks
- Owners list
- High availability strategies
- Deployment and rollback processes
- Data retention and backups
If something fails, we will have all the documentation needed in order to handle the situation in the best way possible.
“Automate what you can, document what you can’t and have the wisdom to know the difference.”
Documentation is very important. I have learned that the hard way. Over the past five years, many of the systems that I have supported lacked documentation, which made it especially difficult to resolve new issues. Following the best practices outlined above for any site reliability engineer has been a major step forward in having a more reliable platform and has made the 3 a.m. wake up call a thing of the past.
Learn more about roles just like this on Cloud Roster.