What is Chaos Engineering? Failure Becomes Reliability

In the IT world, failure is inevitable. A server might go down, an app may fail, etc. Does your team know what to do during a major outage? Do you know what instances may cause a larger systems failure? Chaos engineering, or chaos as a service, will help you fail responsibly.

It almost sounds counterintuitive to think that failing is one of the best security and reliability measures, but this is what chaos engineering is all about. The simple idea behind it is to create chaotic scenarios to test the systems you have in place. Break things on purpose. It’s so punk!
What is Chaos Engineering?

The Origins of Chaos Engineering

In our blog, we have talked about Site Reliability Engineering before but Chaos engineering is a relatively new phenomenon. It all started with Netflix’s move to the AWS cloud in 2010. Netflix saw the cloud as vulnerable. They believed that no instance in the cloud could guarantee permanent uptime. So, they created Chaos Monkey. Chaos Monkey was designed to randomly disable production instances to ensure survivability during common types of failures.
Chaos Monkey wasn’t enough, though. Netflix wanted to create an entire virtual army of chaos, the Simian Army, which includes: Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, and Chaos Gorilla. I won’t go into each monkey’s function, but the idea is simple: Create chaos, guarantee reliability.

The Simian Army may be a fun tool, but it wasn’t always fun for customers. Some of the monkeys were responsible for customer-related problems. The chaos was too uncontrollable. Effectively managing failure like this requires controlled simulation. Thus, Netflix created Failure Injection Testing.

Failure Injection Testing (FIT) was designed to give developers a “blast radius” rather than unmanaged chaos. Mapping out specific places where the tests will occur eliminates the risks. These tests are supposed to be proactive, giving IT teams real experience in dealing with outages and other common problems. Without FIT, chaos as a service wouldn’t be a viable product for a mass audience. Netflix introduced the FIT practice in 2014 when Kolton Andrus was working at the company. Andrus later became the co-founder of Gremlin, a company that offers chaos as a service.

What is Chaos as a Service?

Chaos as a service isn’t exactly chaotic in its current state. The Simian Army may have caused real chaos, but its use as a service is far more controlled and logical. Essentially, if you could simulate chaos in your day-to-day life to maximize your personal efficiency, wouldn’t you?

Putting out fires is a term I constantly hear about in the world of IT. Networking fires, production fires, release fires, etc. Everything is so reactionary, but it doesn’t have to be. Simulation is the best way to learn how to manage a real-world situation. Think of chaos engineering as an experiment. If you’re performing an experiment, you have a hypothesis. Thus, if you don’t have a clue what will happen during a failure, it might not be the right time to use chaos engineering.

You should have some idea about what will happen after you run a chaos experiment. The original Chaos Monkey may have created mostly random chaos to test its systems, but this approach isn’t optimal. Teams should have some idea of what to expect. Having a detailed knowledge and expectation of your systems will make these experiments more effective. If you’re wrong, you will only better understand your systems and know what to fix.

The Benefits of Chaos Engineering

Now, chaos engineering may sound a lot like testing, but it’s not that simple. The primary difference between testing and chaos engineering is the scale and the results. Testing tools are usually simplistic in practice. You provide a testing tool with a condition, and it gives you a result. There’s only so much that can be learned this way. Chaos engineering creates an experimental scenario to not only test your systems but to test yourself and your team. You might discover far more than you asked for. By causing deliberate failures, IT teams will gain confidence that their systems can deal with failures before they occur in production. All complex cloud systems will eventually fail. Using chaos engineering will allow you to recognize what’s wrong with the system, what you can do to fix it, and how to better deal with failure in real time. Building the most effective system requires experimentation. Chaos engineering allows you to run specific scenarios that could happen at any time while a product or service is live. Running these scenarios allows you to measure specific aspects of a failure. Maybe the scenario returned the exact result you expected, maybe it resulted in something completely new. Either way, you’re able to improve your systems and provide the most reliable service to your customers.

Gaining insight into system problems also creates a better production environment. Everyone will know what to look for in the future, and what systems might be vulnerable. You can make changes in your cloud environments based on your results.

The Unpredictability of the Cloud

One of the biggest concerns about the cloud is its relative unpredictability. Netflix introduced chaos engineering to combat their concerns about the cloud. The services that cloud platforms rely on can be inconsistent and chaos engineering is the perfect way to manage this.

Containers, microservices, and distributed systems are becoming a staple for cloud computing. These tools are incredibly useful, but they must be properly maintained. Any cloud provider may be vulnerable to occasional downtime. How you deal with cloud-related problems shouldn’t be figured out through hypotheticals or during a real outage. Chaos engineering can be the unpredictability that the cloud brings. You can use it to discover what to do with your systems in a non-critical environment. Simulating failure allows IT teams to verify that cloud systems are behaving as expected. This kind of tool is invaluable.

Chaos Can Be Fun!

If there were a real zombie apocalypse, it would be nowhere near as enjoyable as a video game or film. The same goes for failure. Simulating failure can easily be turned into a fun activity. You can specify when the failure is going to happen and can prepare a game day around it. These simulated failures have no real consequence, so it’s a great way to channel your love for computing!

Start your cloud training journey with Cloud Academy. Check out more of my posts at Solutions Review here.

Avatar

Written by

Tyler Stearns

I'm the lead editor at Solutions Review's Cloud and Network Monitoring sites. In my writing, I bridge the gap between consumer and technical expert to help readers understand what they're looking for. My passions outside of enterprise technology include film, games, swimming in rivers (only rivers), mechanical keyboards, fun socks, ramen, and goats.

Related Posts

Avatar
Andrew Larkin
— August 7, 2019

Disadvantages of Cloud Computing

If you want to deliver digital services of any kind, you’ll need to estimate all types of resources, not the least of which are CPU, memory, storage, and network connectivity. Which resources you choose for your delivery —  cloud-based or local — is up to you. But you’ll definitely want...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud Platform
Avatar
Adam Hawkins
— June 12, 2019

What is Kubernetes? An Introductory Overview

In part 1 of my webinar series on Kubernetes, I introduced Kubernetes at a high level with hands-on demos aiming to answer the question, "What is Kubernetes?" After polling our audience, we found that most of the webinar attendees had never used Kubernetes before, or had only been expos...

Read more
  • Cloud Computing
  • Kubernetes
Avatar
Scott Huntington
— March 25, 2019

How Does Cloud Computing Work?

Whether you're looking to become a cloud engineer or you're a manager wanting to gain more knowledge, learn the basics of how cloud computing works. Are you wondering about how cloud computing actually works? We can help explain the basic principles behind this technology. Cloud comput...

Read more
  • Cloud Computing
Avatar
Guy Hummel
— March 4, 2019

What is Ansible?

What is Ansible? Ansible is an open-source IT automation engine, which can remove drudgery from your work life, and will also dramatically improve the scalability, consistency, and reliability of your IT environment. We'll start to explore how to automate repetitive system administratio...

Read more
  • Ansible
  • Cloud Computing
Avatar
Cloud Academy Team
— February 11, 2019

What is Puppet? Get Started With Our Course

When it comes to building and configuring IT infrastructure, especially across dozens or even thousands of servers, developers need tools that automate and streamline this process. Enter Puppet, one of the leading DevOps tools for automating delivery and operation of software no matter ...

Read more
  • Cloud Computing
  • Puppet
Avatar
Andrew Larkin
— January 15, 2019

2018 Was a Big Year for Content at Cloud Academy

As Head of Content at Cloud Academy I work closely with our customers and my domain leads to prioritize quarterly content plans that will achieve the best outcomes for our customers. We started 2018 with two content objectives: To show customer teams how to use Cloud Services to solv...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud Platform
Avatar
Cloud Academy Team
— December 21, 2018

2019 Cloud Computing Predictions

2018 was a banner year in cloud computing, with Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) all continuing to launch new and innovative services. We also saw growth among enterprises in the adoption of methodologies supporting the move toward cloud-native...

Read more
  • Cloud Computing
  • Cloud Predictions
Albert Qian
Albert Qian
— August 28, 2018

Introducing Assessment Cycles

Today, cloud technology platforms and best practices around them move faster than ever, resulting in a paradigm shift for how organizations onboard and train their employees. While assessing employee skills on an annual basis might have sufficed a decade ago, the reality is that organiz...

Read more
  • Cloud Computing
  • Product Feature
  • Skill Profiles
Stefano Bellasio
Stefano Bellasio
— July 31, 2018

Cloud Skills: Transforming Your Teams with Technology and Data

How building Cloud Academy helped us understand the challenges of transforming large teams, and how data and planning can help with your cloud transformation. When we started Cloud Academy a few years ago, our founding team knew that cloud was going to be a revolution for the IT indu...

Read more
  • Cloud Computing
  • Skill Profiles
Albert Qian
Albert Qian
— May 23, 2018

Announcing Skill Profiles Beta

Now that you’ve decided to invest in the cloud, one of your chief concerns might be maximizing your investment. With little time to align resources with your vision, how do you objectively know the capabilities of your teams? By partnering with hundreds of enterprise organizations, we’...

Read more
  • Cloud Computing
  • Product Feature
  • Skill Profiles
Avatar
Cloud Academy Team
— April 5, 2018

A New Paradigm for Cloud Training is Needed (and Other Insights We Can Democratize)

It’s no secret that cloud, its supporting technologies, and the capabilities it unlocks is disrupting IT. Whether you’re cloud-first, multi-cloud, or migrating workload by workload, every step up the ever-changing cloud capability curve depends on your people, your technology, and your ...

Read more
  • Cloud Computing
Avatar
Cloud Academy Team
— November 22, 2017

AWS re:Invent 2017: Themes and Tools Shaping Cloud Computing in 2018

As the sixth annual re:Invent approaches, it’s a good time to look back at how the industry has progressed over the past year. How have last year’s trends held up, and what new trends are on the horizon? Where is AWS investing with its products and services? How are enterprises respondi...

Read more
  • AWS
  • Cloud Adoption
  • Cloud Computing
  • re:Invent 2017