SRE Anti-Fragility and Learning from Failure


Anti-Fragility and Learning from Failure
Anti-Fragility and Learning from Failure

This course looks at anti-fragility and how to learn from failure. Anti-fragility is all about understanding disorder and using it to your advantage. Learning from failure helps you to understand why things break, how to fix them, and prevent or minimize the same thing from breaking again. By the end of this course, you'll have a clear understanding of anti-fragility and learning from failure.

If you have any feedback relating to this course, please contact us at

Learning Objectives

  • Understand how SRE, DevOps, and anti-fragility help to reduce risk and increase predictability
  • Learn how to reframe failure and mistakes so that you can learn from them
  • Learn about tools that can be used to reduce the risk of failure

Intended Audience

  • Anyone interested in learning about SRE and its fundamentals
  • Software Engineers interested in learning about how to use and apply SRE within an operations environment
  • DevOps practitioners interested in understanding the role of SRE and how to consider using it within their own organization


To get the most out of this learning path, you should have a basic understanding of DevOps, software development, and the software development lifecycle.


Link to YouTube video referenced in the course:


Welcome back! In this course, I'm going to discuss anti-fragility and how to learn from failure. Anti-fragility is all about understanding disorder and using it to your advantage. Learning from failure helps you to understand why things break, how to fix them and prevent or minimize the same thing from breaking again. By the end of this course, you'll have a clear understanding of anti-fragility and learning from failure.

Consider the following question. Failure is bad but for who? Most organizations view failure as a bad thing, something to be avoided at all costs and something that involves repercussions for those involved, creating a culture of fear of failure. Conversely, those same organizations often advocate "big bang" large risk deployments with lots of risk and uncertainty, the kind of deployments that often invite failures.

Pause here briefly and consider your own organization. How does it view failure? With this in mind what would need to happen in order for your organization to be more tolerant of failure, treating it instead as an opportunity to learn. Finally, what would need to happen to make it safer for your organization to fail and to avoid large scale catastrophic failure in production? As Tony Robbins, a life coach puts it, "There is no such thing as failure. There are only results. It's time to stop beating yourself up and start realizing that everything you do is a success or a learning experience."

It's worth remembering that in many cultures we are taught that failure is something to be ashamed of and embarrassed about as individuals. Reframing failure successfully is heavily dependent on a shared sense of purpose, shared goals and team accountability. Winston Churchill once said, "Success is stumbling from failure to failure with no loss of enthusiasm." So again, why learn from failure? Learning from failure is important and doing so it helps us to understand how things work together. When things break, we then know how to fix them. This helps in minimizing future issues and helps expediate the resolution process, which in turn helps our business to remain as profitable as possible.

Organizations are often weighing up the balance between acceptable risk and benefits. Is it worth investing an anti-fragility if the risk is low? There may be other areas where risk of failure is high so the benefits of anti-fragility are an easier sell here. The challenge is often new features versus anti-fragility. Here, it's important to prioritize according to the risk and benefits.

Quite often in technology we're thrown into firefighting situations without any preparation or practice. Ideally, we should know what to do when things go wrong. This requires preparation and training and as in the case of real fires, investigations afterwards, where we learn from what happened. Without preparation and training, the alternatives can seem a bit more risky. There are still many organizations that play this high-risk game.

With SRE, DevOps and anti-fragility the goal is clearly to lower risk and to be able to create more predictability. The benefits of anti-fragility. Here, we're going to drill down into anti-fragility and in particular the benefits. Anti-fragility is all about understanding disorder and using it to your advantage, to be more resilient. When running large scale distributed systems, understanding the potential disorder that can take place will help you to make those systems more robust and resilient.

To begin with failure happens, therefore, why not use it to your advantage? As Google states, "failure happens, there is no way around it so stop pointing fingers. Embracing failure will help improve MTTD and MTTR metrics. Proactively addressing failure leads to more robust systems." Understanding what both MTTD and MTTR stand for is explained in the next slide. Let's head there now.

There are several important metrics that anti-fragility aims to improve, they are: one, MTTD Mean Time to Detect, two, MTTR, Mean Time to Recover a Component, three, MTRS, Mean Time to Recover a Service, four, SLO, Service Level Objective, and five, RPO, Recovery Point Objective. I'll now review each of these individually in the following slides.

Starting with MTTD, Mean Time to Detect failure or incidents. By introducing failure we optimize our monitoring making it more likely we will detect real incidents. Next MTTR, Mean Time to Recover, Components. Simulating component failure allows us to create automation to try and auto-recover. We can also build in more resilience to prevent failure of single components. Next MTRS, Mean Time to Recover a Service. Chaos engineering approaches identify key interfaces and dependencies across services. Pinpointing areas where more resilience may be required. Next SLO or Service Level Objective. A fire drill, where for example, a database is taken down and for which may result in an SLO being broken. RPO or Recovery Point Objective. As defined by business continuity planning. It is the maximum targeted period in which data or transactions might be lost from an IT service due to a major incident. If RPO is measured in minutes, or even in a few hours then in practice, off-site mirrored backups must be continuously maintained.

In this example, a daily offsite backup on tape will not be sufficient. In another example, introducing failure on a messaging queue may indicate excessive data loss outside of the stated RPO. More frequent backups of the queue data may be needed to actually meet the desired RPO. Let's now move on and understand the journey and process of moving from a no learning from failure environment to a learning from failure environment.

Consider the following quote. "You're either a learning organization or you're losing to somebody who is..." This implies that if your organization is not learning from failure then someone else probably is, and they'll become more stable and resilient and will convince your customers to jump ship. Creating a culture where learning from failure becomes the norm will indeed help your organization to thrive rather than choke.

To enable culture change, leadership needs to get on board and support it. Culture change is always going to be tough, but is often the difference between life and death. Moving from failure fear to failure investment requires changes across the delivery spectrum. To help you along your journey consider the following four plays. The first play is about understanding the principle of continuous learning.

If you've already previously read "The Phoenix Project" and or "The DevOps Handbook" then you'll already be familiar with the three ways framework for DevOps. The first way talks about the principles of flow. The second way talks about the principles of feedback and the third way talks about the principles of continuous learning. The third way as just mentioned, encourages a culture that fosters two things: one, continual experimentation, taking risks and learning from failure. And two, understanding that repetition and practice is the prerequisite to mastery.

To help you along with this, consider the following: allocate time for the improvement of daily work, create rituals that reward the team for taking risks, introduce faults into the system to increase resilience and plan time for safe experimentation and innovation for example, hackathons. Moving on to the second play, this introduces you to the Westrum Model, a study conducted by Dr. Ron Westrum showed how culture affects performance.

Westrum's study looked at how organizations respond to problems and opportunities. The types described pathological, bureaucratic and generative are shaped by the preoccupations of the organizations leaders. In other words, team leaders shape the organizations culture by creating incentive structures that reward certain behaviors.

In the context of SRE, techniques that we can use to create and maintain a high-trust culture include: one, encouraging and creating boundary-spanning teams. Two, making quality availability and security everyone's responsibility, instead of just Ops. Three, holding blameless postmortems when incidents and outages occur to develop effective countermeasures and create global learning. And four, maximizing everyone's creativity to find novel solutions to problems. And the third play, we'll talk about fire drills.

Fire drills build on the concepts of business continuity planning, BCP and disaster recovery, DR which have been around in practice for decades. Fire drills are focused on walking through what happens when something goes wrong, examining and testing, both technical things and non-technical things. Fire drills are used to ensure a business can continue to operate during unforeseen events or failures such as natural disasters or emergencies, often an audit requirement.

For example, many organizations conduct an annual data center failover test. The following list provides additional examples of what a fire drill can be: one, loss of facility, a datacenter or cloud region. Two, loss of technology for example a database crash. Three, loss of resources, a key member of the team leaves. And four, loss of critical third party vendors, perhaps a business fails or is acquired and dissolved. Again, note the last two examples which are non-take examples.

Moving onto the fourth and final play, Chaos engineering. Chaos engineering, a term first pioneered by Netflix goes beyond fire drills and is formalized as an engineering practice. Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions. Netflix are one of the pioneers of the anti-fragility movement producing the very popular, Simian Army suite of tools. This suite of tools can be used to test the reliability, security or resiliency of AWS host infrastructure and includes Chaos Monkey, which disables production instances at random. Latency Gorilla, which simulates network delays and Chaos Gorilla, which simulates taking down Amazon data centers, AZs.

If you're new to Chaos engineering and want to know how to get started, then consider the following steps for adoption: one, segregate the system into key components. Two tess the system without key components being available. Three break the system in non-prod environments first. Four, introduce failure of key components in prod. Five, introduce database failure in prod. And six, introduce a total system failure n prod.

Additionally, consider the following to highlight areas where to focus your Chaos engineering on: one, look at holistic logging. For example, what keeps the full service up? Two, identify dependencies. Three, improve by error handling and recovery and four learn from real failures. Chaos engineering also helps us to minimize the so-called blast radius. Any outage should affect as little of the ecosystem as possible. Failure testing shows the current span of this radius.

When you perform Chaos engineering what it's really trying to do is make you think about and design in automated recovery. Things to consider when designing automated recovery are: one, creating immutable infrastructure using infrastructure as code. Two, covering all system state and functional code with automated tests. Three, deploying holistic logging and monitoring systems to make services observable. Four, implementing smart alerting, triggers and prescriptive analytics. Five, creating self healing infrastructures and or applications. And six test! Then test again, rinse and repeat continuously.

By their very own nature, some tools and platforms have built-in-self healing mechanisms of which should be utilized. Consider tools like Kubernetes and AWS Auto-Scaling which can: detect impaired instances of servers or containers and recycle them, maintain infrastructure at a defined level, automatically scale infrastructure and or applications up and down based on demand, execute maintenance commands on the fly for example, re-indexing databases when queries are running slow and integrate themselves directly into monitoring services.

Now before the school finishes, consider watching the following YouTube video, which covers the concept of destructive testing and introduces a tool for inducing network failures. Okay, that completes this course. In this course, you learned about what anti-fragility is and how to learn from failure and embrace it.

Anti-fragility is all about understanding disorder and using it to your advantage, to be more resilient. When running large scale distributed systems, understanding the potential disorder that can take place helps you to design and make those systems much more robust and resilient.

Okay, close this course and I'll see you shortly in the next one.

About the Author
Learning Paths

Jeremy is a Content Lead Architect and DevOps SME here at Cloud Academy where he specializes in developing DevOps technical training documentation.

He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 25+ years. In recent times, Jeremy has been focused on DevOps, Cloud (AWS, Azure, GCP), Security, Kubernetes, and Machine Learning.

Jeremy holds professional certifications for AWS, Azure, GCP, Terraform, Kubernetes (CKA, CKAD, CKS).