In an IT environment, people and processes are as important as software. The concept of site reliability engineering, pioneered by Google, applies aspects of software engineering to operations with the goal of creating software systems that are highly scalable and reliable. In this post, we’ll take a closer look at the role of the Site Reliability Engineer and some of the best practices for this role, including runbooks, incident response reports, postmortem reports, and root cause analysis.
What does a Site Reliability Engineer do?
Ben Treynor, the founder of Google’s Site Reliability Team defines site reliability engineering as “what happens when a software engineer is tasked with what used to be called operations.”
A Site Reliability Engineer is responsible for the availability, performance, monitoring, and incident response, among other things, of the platforms and services that our company runs and owns.
We must make sure that everything that goes to production complies with a set of general requirements like diagrams, dependencies of other services, monitoring and logging plans, backups and possible high availability setups.
Even when the software complies with all of the necessary requirements such as uncaught exceptions, hardware degradation, networking problems, high usage of resources, or slow responses from our services could happen at any time. We always need to be prepared and be ready to act.
Our effectiveness will be measured as a function of mean time to recover (MTTR) and mean time to failure (MTTF). In other words, we must have our services up and running again as quickly as possible, and we must avoid any subsequent failure for as long as possible.
The Importance of documenting incidents
Now that we have an idea of what a Site Reliability Engineer does, let’s look at a typical scenario that one might encounter.
It is 3 a.m. and as the on-call engineer, you receive an SMS alert that something is wrong with one of your platforms. Your company is losing money every minute that the platform is down, so you get up and connect to the company’s VPN and start to troubleshoot. This is not difficult if you know the platform or service that went down. It is relatively easy if you have been working in the company for a long time and have the knowledge to fix it. You will know where to start and what to do… right?
But what if you are new, or perhaps you are new to the platform, which is owned by someone else. What are you going to do?
What is a runbook?
One of the best resources for someone in this situation is a runbook.
Put simply, a runbook is a set of instructions of what to perform or check when something goes wrong with any given service, application or platform. Also known as an operational manual, a runbook is our guide for resolving an incident. For a site reliability engineer working on call, it can be the difference between resolving an issue in a few minutes or a few hours. A runbook will guide us through the procedures necessary for getting our service up and running again quickly (or within a reasonable amount of time).
Ideally, a runbook is written by developers, in the case of code, and followed by anyone that is on-call. This means that any runbook should exist before the new piece of software goes to production. Of course, this is, as I mentioned, the ideal case.
Runbooks can be written to troubleshoot infrastructures, hosts or any other services and platforms that are in use. Therefore, you can have a different runbook for any issue or service that requires a human to be fixed. For example, if our runbook can be converted to a script, let’s do that. Otherwise, we should document all of the steps required to handle an alert. In other words, let’s automate and make that service able to heal itself. However, we should also take into consideration that we must make the required analysis and then implement the changes to our service to prevent it happening again.
If you don’t have runbooks in your organization, you should at least be writing down every step performed in the process of troubleshooting an issue. Even if the steps taken did not eventually resolve the problem, it’s worthwhile to have a record of how you tried to solve the issue. In this way, you can learn from these mistakes down the road.
If your organization already has several runbooks, it’s likely that you will encounter a situation that was not considered at the beginning of the design or development, and you will have to create a new runbook for that case.
You can check a project I forked on Github with a runbook template.
What is an incident response report?
Once an issue has been resolved, you’ll want to properly document the incident to ensure that it doesn’t happen again. An incident response report is meant to record everything that happened, from each step that you performed to all of the commands (both good and bad) that you executed. Did you follow a runbook or did you have to troubleshoot in a different way? What was the initial situation? Did we receive an alert or discover the issue another way? If we did receive an alert, was the message meaningful in terms of helping us troubleshoot the issue?
In addition to describing what happened, the incident response report should include all of the communications around resolving it. Who did we notify that the service was down? Who helped us? Who was affected by the incident? How serious was it, and for how long was the service down?
We need to document every single aspect of what we did to fix our service. This information will be used in our analysis to discover the root cause. When we have gathered all of this information and the root cause is determined, we will be able to make or request the required changes to make our platform more reliable. This will help us improve our mean time to fail and mean time to recover.
What is a postmortem report?
For a site reliability engineer, resolving the problem is only half of the job. We must make sure that it doesn’t happen again. Therefore, we will need to perform a root cause analysis. To properly perform this analysis we should have all of the information that would ideally be contained in an incident response report.
Like the incident response report, our postmortem report will contain a timeline of everything we did to fix the problem, a root cause analysis, corrective and preventative measures, and a section that describes the resolution and recovery of our service.
The root cause analysis can be a discussion or brainstorming session about what worked and what didn’t work when resolving the incident. The corrective and preventative measures will be tickets or tasks that we must perform to avoid or mitigate future incidents.
Finally, the resolution and recovery section will be filled out with technical information and possibly code snippets that we used while resolving the fault.
Monitoring and alerting
Our sample scenario above was prompted by an alert. However, the alert wouldn’t exist without a good monitoring process in place.
Monitoring and alerting are two essential processes for site reliability engineers. We must monitor every possible metric within our platform so that we have a precise understanding of our system’s health at all times. The monitoring plan must be created along with the system design, or with each service that we are going to support.
A common practice is to monitor specific metrics, set thresholds, and trigger alerts based on those thresholds. The lesson to learn here is to try to make software that can interpret the alerts and automatically heal our system, sending alerts only if human intervention is absolutely necessary. These alerts should contain a clear explanation of the issue, possible tasks to perform or try, and links to documentation and runbooks related to the service incurring the problem.
The book on Site Reliability Engineering defines three kinds of valid monitoring output:
Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.
No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.
As you can see, we don’t need to receive alerts every time a threshold has been exceeded, but only when we have to take action to fix the situation. I would highly recommend avoiding emails as your alert system. It just doesn’t work. According to the concept of alert fatigue, we tend to ignore email alerts because we receive too many, and the majority are not real emergencies. This will lead us to not only ignore non-important alerts but to miss the important ones completely. As a result, by the time we finally act, it may be too late.
We have been in the situation where changes are sent to production without our knowledge or worse, without following the guidelines set to deploy them. This is why a change management process is so important, and every developer must stick with the plan. Part of a site reliability engineer’s job is to set those rules, create the tools needed to automate all the processes, and facilitate the deployment and rollback of new services or changes to existing ones.
Part of the change management process is making sure that changes and any new services that will be deployed comply with a list of requirements. This should include:
- Monitoring plan
- Alerts runbooks
- Owners list
- High availability strategies
- Deployment and rollback processes
- Data retention and backups
If something fails, we will have all the documentation needed in order to handle the situation in the best way possible.
“Automate what you can, document what you can’t and have the wisdom to know the difference.”
Documentation is very important. I have learned that the hard way. Over the past five years, many of the systems that I have supported lacked documentation, which made it especially difficult to resolve new issues. Following the best practices outlined above for any site reliability engineer has been a major step forward in having a more reliable platform and has made the 3 a.m. wake up call a thing of the past.
Getting Started With Site Reliability Engineering
Much has been written and discussed about SRE (Site Reliability Engineering) from what it is, how to do it, and how it's the same (or different) as DevOps. Google coined the term, defined the profession, and wrote the book on it. Their "Site Reliability Engineering" book covers the idea...
What DevOps Means for Risk Management
What Does DevOps Mean for Risk Management?Adopting DevOps makes the unfamiliar uneasy in two areas. One, they see an inherently risky choice between speed and quality and second, they are concerned that the quick iterations of DevOps may break compliance rules or introduce security vu...
How DevOps Transforms Software Testing
Testing is arguably the most important aspect of software development. Whether manual or automated, testing ensures the software works as expected. Broken software causes production outages, unsatisfied customers, refunds, decreased trust, or even complete financial collapse. Testing mi...
From Monolith to Serverless – The Evolving Cloudscape of Compute
Containers can help fragment monoliths into logical, easier to use workloads. The AWS Summit New York was held on July 17 and Cloud Academy sponsored my trip to the event. As someone who covers enterprise cloud technologies and services, the recent Amazon Web Services event was an insig...
Four Tactics for Cultural Change in DevOps Adoption
Many organizations approach digital transformation and DevOps adoption with the belief that simply by selecting and using the right tools, they will achieve higher levels of automation and gain massive efficiencies as a result. While DevOps adoption does require new tools and processes,...
Get Started with HashiCorp Vault
Ongoing threats of data breaches and cyber attacks remain top of mind for every team responsible for securing cloud workloads and applications, especially with the challenge of managing secrets including passwords, tokens, API keys, certificates, and more. Complexity is especially notab...
Open Source Software Security Risks and Best Practices
Enterprises are leveraging a variety of open source products including operating systems, code libraries, software, and applications for a range of business use cases. While using open source comes with cost, flexibility, and speed advantages, it can also pose some unique security chall...
What is Static Analysis Within CI/CD Pipelines?
Thanks to DevOps practices, enterprise IT is faster and more agile. Automation in the form of automated builds, tests, and releases plays a significant role in achieving those benefits and creates the foundation for Continuous Integration/Continuous Deployment (CI/CD) pipelines. However...
What is Chaos Engineering? Failure Becomes Reliability
In the IT world, failure is inevitable. A server might go down, an app may fail, etc. Does your team know what to do during a major outage? Do you know what instances may cause a larger systems failure? Chaos engineering, or chaos as a service, will help you fail responsibly.It almost...
10 Ingredients for DevOps Transformation with Mark Andersen
At Capital One, DevOps is about delivering high quality, working software, faster. This means software that is reliable, secure, usable, and performant while providing value and accomplishing those important end user goals. Everything is about speed of delivery and getting that feedback...
SQL Injection Lab: Think Like a Hacker
Security is IT’s top spending priority according to the 2017/2018 Computer Economics IT Spending & Staffing Benchmarks report*. Given the frequent changes and updates in vendor platforms, the pressure is on for IT teams who need to keep their infrastructures and data secure. As brea...
Women in Tech: Zamira Jaupaj, DevOps Engineer
In building an enterprise culture of cloud, DevOps skills complement the enterprise’s need to automate development, testing, deployment, and operations processes for their public cloud deployments. In this latest post in our Women in Tech series, we’ll be talking to Zamira Jaupaj, a Dev...