This course provides you with an introduction to Blameless Postmortems. In Site Reliability Engineering (SRE), Blameless Postmortems are a retrospective meeting whose goal is to recap and analyze a significant service failure. It provides an open forum where everyone can ask questions, share their experience, and gain a clear understanding of exactly what happened, and ultimately how to help prevent or reduce similar incidents from happening again.
If you have any comments or feedback, feel free to reach out to us at support@cloudacademy.com.
Learning Objectives
- Learn about Blameless Postmortems as used within Site Reliability Engineering (SRE)
- Understand its core vocabulary, principles, and practices
Intended Audience
- Anyone interested in learning about Site Reliability Engineering and its fundamentals
- DevOps practitioners who want to understand the role of Site Reliability Engineer
Prerequisites
- A basic understanding of DevOps
- A basic understanding of the software development life cycle
In this section, I want to explain what a blameless postmortem is, including why they're important and how they should be conducted. As stated previously, the second goal of DevOps is to accept failure as normal. Now, this is an important concept to understand. No matter how hard you try, failures will happen. Change and growth requires risk. Faster growth requires higher risk. In order to reduce risk to near zero, you would need to reduce your rate of change to near zero.
Now, ask yourself, will my users be satisfied with a stable but quickly out-of-date system? Instead of trying to avoid failure at any cost, you can view it as an opportunity to grow. When things break, we learn how to fix them. When you break and fix something enough times you gain a better understanding of how it works. This understanding helps minimize future issues and expedites the resolution process.
In site reliability engineering, this is accomplished through holding retrospectives or blameless postmortems. A retrospective or post-mortem is a meeting whose goal is to recap and analyze a significant service failure. It provides an open forum where everyone can ask questions, share their experience, and gain a clear understanding of exactly what happened. The goals of a post-mortem are to identify the contributing factors, determine how those factors could have been mitigated and to come up with a list of action items that will prevent the same failure from happening again.
A blameless post-mortem is one that focuses on dealing with the incident without trying to single out an individual or team for bad behavior. It assumes that everyone involved had good intentions and made the best choices they could with the information at hand. A post-mortem should never become a witch hunt, looking for someone or something to blame. This will inevitably create a culture in which issues are swept under the rug, leading to greater risk for the organization.
So, when conducting a post-mortem, you will want to ask the following questions:
- When did the incident begin?
- When did the incident end?
- How were we notified that there was a problem? It is important to understand exactly how long the problem lasted and how long it took you to notice that there was a problem. You may need to adjust your monitoring and alert system if there is a significant gap between the two.
- Who was involved in responding?
- When did we begin to respond?
- What was our response? Carefully review what your response was. Was it quick enough? Did it follow established procedures? Was it sufficient?
- What was affected? It is important to document everything that was affected, which systems were down, which customers noticed, was there any revenue lost.
- Is there anything else that still needs to be done to recover? You might've made some temporary fixes that need to be replaced with a long-term solution, or there may be other after effects that need to be dealt with in the future.
- What were all the things that contributed to this failure? Remember that in complicated systems there are usually many contributing factors to a failure, and not a single root cause. Identify all factors and document them. This should include writing up new bugs in your tracking system.
- How can we avoid similar problems in the future? It is critical to establish detailed action items. What specific changes are going to be made? What is the deadline for making those changes? Who specifically is going to be responsible for making them? Remember you wanna focus on solutions. Don't assign blame for past mistakes, but do assign responsibility for future improvement. Common types of changes will include things such as fixing bugs, adding new infrastructure, updating or building new tools, revising current policies and creating new documentation or training.
- What went right?
- What went wrong?
- Where did we get lucky? You not only want to identify what went wrong, but also the things that help mitigate and recover from the issue. And sometimes you just get really lucky and it's important to call that out as well.
You want an accurate understanding of all risks so that you can properly prioritize your fixes. Blameless postmortems encourage open and honest communication after a service failure. They acknowledge that system design is complicated and that human beings make mistakes. They also ensure that your team will learn from those mistakes and avoid repeating them in the future.
Daniel began his career as a Software Engineer, focusing mostly on web and mobile development. After twenty years of dealing with insufficient training and fragmented documentation, he decided to use his extensive experience to help the next generation of engineers.
Daniel has spent his most recent years designing and running technical classes for both Amazon and Microsoft. Today at Cloud Academy, he is working on building out an extensive Google Cloud training library.
When he isn’t working or tinkering in his home lab, Daniel enjoys BBQing, target shooting, and watching classic movies.