What is Chaos Engineering? Failure Becomes Reliability

In the IT world, failure is inevitable. A server might go down, an app may fail, etc. Does your team know what to do during a major outage? Do you know what instances may cause a larger systems failure? Chaos engineering, or chaos as a service, will help you fail responsibly.

It almost sounds counterintuitive to think that failing is one of the best security and reliability measures, but this is what chaos engineering is all about. The simple idea behind it is to create chaotic scenarios to test the systems you have in place. Break things on purpose. It’s so punk!
What is Chaos Engineering?

The Origins of Chaos Engineering

In our blog, we have talked about Site Reliability Engineering before but Chaos engineering is a relatively new phenomenon. It all started with Netflix’s move to the AWS cloud in 2010. Netflix saw the cloud as vulnerable. They believed that no instance in the cloud could guarantee permanent uptime. So, they created Chaos Monkey. Chaos Monkey was designed to randomly disable production instances to ensure survivability during common types of failures.
Chaos Monkey wasn’t enough, though. Netflix wanted to create an entire virtual army of chaos, the Simian Army, which includes: Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, and Chaos Gorilla. I won’t go into each monkey’s function, but the idea is simple: Create chaos, guarantee reliability.

The Simian Army may be a fun tool, but it wasn’t always fun for customers. Some of the monkeys were responsible for customer-related problems. The chaos was too uncontrollable. Effectively managing failure like this requires controlled simulation. Thus, Netflix created Failure Injection Testing.

Failure Injection Testing (FIT) was designed to give developers a “blast radius” rather than unmanaged chaos. Mapping out specific places where the tests will occur eliminates the risks. These tests are supposed to be proactive, giving IT teams real experience in dealing with outages and other common problems. Without FIT, chaos as a service wouldn’t be a viable product for a mass audience. Netflix introduced the FIT practice in 2014 when Kolton Andrus was working at the company. Andrus later became the co-founder of Gremlin, a company that offers chaos as a service.

What is Chaos as a Service?

Chaos as a service isn’t exactly chaotic in its current state. The Simian Army may have caused real chaos, but its use as a service is far more controlled and logical. Essentially, if you could simulate chaos in your day-to-day life to maximize your personal efficiency, wouldn’t you?

Putting out fires is a term I constantly hear about in the world of IT. Networking fires, production fires, release fires, etc. Everything is so reactionary, but it doesn’t have to be. Simulation is the best way to learn how to manage a real-world situation. Think of chaos engineering as an experiment. If you’re performing an experiment, you have a hypothesis. Thus, if you don’t have a clue what will happen during a failure, it might not be the right time to use chaos engineering.

You should have some idea about what will happen after you run a chaos experiment. The original Chaos Monkey may have created mostly random chaos to test its systems, but this approach isn’t optimal. Teams should have some idea of what to expect. Having a detailed knowledge and expectation of your systems will make these experiments more effective. If you’re wrong, you will only better understand your systems and know what to fix.

The Benefits of Chaos Engineering

Now, chaos engineering may sound a lot like testing, but it’s not that simple. The primary difference between testing and chaos engineering is the scale and the results. Testing tools are usually simplistic in practice. You provide a testing tool with a condition, and it gives you a result. There’s only so much that can be learned this way. Chaos engineering creates an experimental scenario to not only test your systems but to test yourself and your team. You might discover far more than you asked for. By causing deliberate failures, IT teams will gain confidence that their systems can deal with failures before they occur in production. All complex cloud systems will eventually fail. Using chaos engineering will allow you to recognize what’s wrong with the system, what you can do to fix it, and how to better deal with failure in real time. Building the most effective system requires experimentation. Chaos engineering allows you to run specific scenarios that could happen at any time while a product or service is live. Running these scenarios allows you to measure specific aspects of a failure. Maybe the scenario returned the exact result you expected, maybe it resulted in something completely new. Either way, you’re able to improve your systems and provide the most reliable service to your customers.

Gaining insight into system problems also creates a better production environment. Everyone will know what to look for in the future, and what systems might be vulnerable. You can make changes in your cloud environments based on your results.

The Unpredictability of the Cloud

One of the biggest concerns about the cloud is its relative unpredictability. Netflix introduced chaos engineering to combat their concerns about the cloud. The services that cloud platforms rely on can be inconsistent and chaos engineering is the perfect way to manage this.

Containers, microservices, and distributed systems are becoming a staple for cloud computing. These tools are incredibly useful, but they must be properly maintained. Any cloud provider may be vulnerable to occasional downtime. How you deal with cloud-related problems shouldn’t be figured out through hypotheticals or during a real outage. Chaos engineering can be the unpredictability that the cloud brings. You can use it to discover what to do with your systems in a non-critical environment. Simulating failure allows IT teams to verify that cloud systems are behaving as expected. This kind of tool is invaluable.

Chaos Can Be Fun!

If there were a real zombie apocalypse, it would be nowhere near as enjoyable as a video game or film. The same goes for failure. Simulating failure can easily be turned into a fun activity. You can specify when the failure is going to happen and can prepare a game day around it. These simulated failures have no real consequence, so it’s a great way to channel your love for computing!

Start your cloud training journey with Cloud Academy. Check out more of my posts at Solutions Review here.

Avatar

Written by

Tyler Stearns

I'm the lead editor at Solutions Review's Cloud and Network Monitoring sites. In my writing, I bridge the gap between consumer and technical expert to help readers understand what they're looking for. My passions outside of enterprise technology include film, games, swimming in rivers (only rivers), mechanical keyboards, fun socks, ramen, and goats.


Related Posts

Avatar
Cloud Academy Team
— July 9, 2020

Which Certifications Should I Get?

The old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and companies. With all that in mind, the s...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Haley Osborne
Haley Osborne
— July 6, 2020

Web Hosting vs. Cloud Hosting: What’s the Difference?

A growing number of businesses go online annually. This is not surprising because the future is in online sales. According to forecasts, only in the U.S., the number of online shoppers will increase to 300 million by 2023, which is 91% of the total population of the country. The begi...

Read more
  • Cloud Computing
  • cloud hosting
  • web hosting
Vinay Singh
Vinay Singh
— April 20, 2020

10 Benefits of Using Cloud Storage

It’s 2020, and now cloud storage has become one of the most convenient and efficient methods to store data online. There are many storage service providers on the internet, and this area is so vast now every big tech company owns a separate storage facility, which helps to generate a si...

Read more
  • Cloud Computing
  • Cloud Storage
  • Storage
Stefano Bellasio
Stefano Bellasio
— January 23, 2020

Learn Cloud Computing: Prerequisites

What are the prerequisites and requirements to learn cloud computing? This is the first article in a series to introduce our members to the prerequisites to learning cloud computing. This was a question I was emailed countless times from our users, and while we have Learning Paths, AWS...

Read more
  • Cloud Computing
  • Continuous Learning
Monica Rodriguez
Monica Rodriguez
— January 16, 2020

8 Financial Benefits of Cloud Migration

Companies that have long migrated to the cloud many times have confirmed the effectiveness of this solution from a practical point of view. This gives you more flexibility to perform tasks, work with data is organized more quickly and efficiently, and the data itself is stored under rel...

Read more
  • Cloud Computing
  • Cloud Migration
Avatar
Riley Mathews
— January 8, 2020

10 Reasons Digital Marketing Is More Successful With Cloud Computing 

Cloud computing and digital marketing Cloud computing is a technology that serves extensive benefits to businesses. It empowers them to operate more effectively and improve their productivity as well. This is because the tools and applications that are integrated into the cloud can be ...

Read more
  • Cloud Computing
  • digital marketing
Avatar
Riley Mathews
— December 18, 2019

Cloud Computing: Can It Be a Solution for Your Marketing Strategy?

The competition in the business landscape is daunting and you need to go the extra mile to establish your presence in the market. Besides just ensuring that the products you offer are of the best quality, your marketing strategy should also be better than the rest. Basically, it is all ...

Read more
  • Cloud Computing
  • marketing
  • marketing strategy
Chester Avey
Chester Avey
— November 7, 2019

Cloud Computing Solutions: 7 Trends for the Future

The world of cloud computing is in a state of flux. Not long ago, the cloud was considered an emerging technology, known only to IT specialists. Today it is a part of everyday life – 96% of businesses use the cloud in one form or another, and this number only looks set to grow. Whether ...

Read more
  • Cloud Computing
  • internet of everything
  • multi-cloud
  • Security
  • SEO
Avatar
Walter Stone
— October 10, 2019

8 Surprising Ways Cloud Computing Is Changing Education

Cloud computing: Empowering the education industry Over the years, the education industry has come a long way. Teaching and learning are no longer confined to textbooks and classrooms and now reaches computers and mobile devices. Today, learners are always connected — whether they are ...

Read more
  • Cloud Computing
  • education industry
Avatar
Michael Sheehy
— August 19, 2019

What Exactly Is a Cloud Architect and How Do You Become One?

One of the buzzwords surrounding the cloud that I'm sure you've heard is "Cloud Architect." In this article, I will outline my understanding of what a cloud architect does and I'll analyze the skills and certifications necessary to become one. I will also list some of the types of jobs ...

Read more
  • AWS
  • Cloud Computing
Avatar
Andrew Larkin
— August 7, 2019

Disadvantages of Cloud Computing

If you want to deliver digital services of any kind, you’ll need to estimate all types of resources, not the least of which are CPU, memory, storage, and network connectivity. Which resources you choose for your delivery —  cloud-based or local — is up to you. But you’ll definitely want...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud Platform
Avatar
Adam Hawkins
— June 12, 2019

What is Kubernetes? An Introductory Overview

In part 1 of my webinar series on Kubernetes, I introduced Kubernetes at a high level with hands-on demos aiming to answer the question, "What is Kubernetes?" After polling our audience, we found that most of the webinar attendees had never used Kubernetes before, or had only been expos...

Read more
  • Cloud Computing
  • Kubernetes