What is Chaos Engineering? Failure Becomes Reliability

In the IT world, failure is inevitable. A server might go down, an app may fail, etc. Does your team know what to do during a major outage? Do you know what instances may cause a larger systems failure? Chaos engineering, or chaos as a service, will help you fail responsibly.

It almost sounds counterintuitive to think that failing is one of the best security and reliability measures, but this is what chaos engineering is all about. The simple idea behind it is to create chaotic scenarios to test the systems you have in place. Break things on purpose. It’s so punk!
What is Chaos Engineering?

The Origins of Chaos Engineering

In our blog, we have talked about Site Reliability Engineering before but Chaos engineering is a relatively new phenomenon. It all started with Netflix’s move to the AWS cloud in 2010. Netflix saw the cloud as vulnerable. They believed that no instance in the cloud could guarantee permanent uptime. So, they created Chaos Monkey. Chaos Monkey was designed to randomly disable production instances to ensure survivability during common types of failures.
Chaos Monkey wasn’t enough, though. Netflix wanted to create an entire virtual army of chaos, the Simian Army, which includes: Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, and Chaos Gorilla. I won’t go into each monkey’s function, but the idea is simple: Create chaos, guarantee reliability.

The Simian Army may be a fun tool, but it wasn’t always fun for customers. Some of the monkeys were responsible for customer-related problems. The chaos was too uncontrollable. Effectively managing failure like this requires controlled simulation. Thus, Netflix created Failure Injection Testing.

Failure Injection Testing (FIT) was designed to give developers a “blast radius” rather than unmanaged chaos. Mapping out specific places where the tests will occur eliminates the risks. These tests are supposed to be proactive, giving IT teams real experience in dealing with outages and other common problems. Without FIT, chaos as a service wouldn’t be a viable product for a mass audience. Netflix introduced the FIT practice in 2014 when Kolton Andrus was working at the company. Andrus later became the co-founder of Gremlin, a company that offers chaos as a service.

What is Chaos as a Service?

Chaos as a service isn’t exactly chaotic in its current state. The Simian Army may have caused real chaos, but its use as a service is far more controlled and logical. Essentially, if you could simulate chaos in your day-to-day life to maximize your personal efficiency, wouldn’t you?

Putting out fires is a term I constantly hear about in the world of IT. Networking fires, production fires, release fires, etc. Everything is so reactionary, but it doesn’t have to be. Simulation is the best way to learn how to manage a real-world situation. Think of chaos engineering as an experiment. If you’re performing an experiment, you have a hypothesis. Thus, if you don’t have a clue what will happen during a failure, it might not be the right time to use chaos engineering.

You should have some idea about what will happen after you run a chaos experiment. The original Chaos Monkey may have created mostly random chaos to test its systems, but this approach isn’t optimal. Teams should have some idea of what to expect. Having a detailed knowledge and expectation of your systems will make these experiments more effective. If you’re wrong, you will only better understand your systems and know what to fix.

The Benefits of Chaos Engineering

Now, chaos engineering may sound a lot like testing, but it’s not that simple. The primary difference between testing and chaos engineering is the scale and the results. Testing tools are usually simplistic in practice. You provide a testing tool with a condition, and it gives you a result. There’s only so much that can be learned this way. Chaos engineering creates an experimental scenario to not only test your systems but to test yourself and your team. You might discover far more than you asked for. By causing deliberate failures, IT teams will gain confidence that their systems can deal with failures before they occur in production. All complex cloud systems will eventually fail. Using chaos engineering will allow you to recognize what’s wrong with the system, what you can do to fix it, and how to better deal with failure in real time. Building the most effective system requires experimentation. Chaos engineering allows you to run specific scenarios that could happen at any time while a product or service is live. Running these scenarios allows you to measure specific aspects of a failure. Maybe the scenario returned the exact result you expected, maybe it resulted in something completely new. Either way, you’re able to improve your systems and provide the most reliable service to your customers.

Gaining insight into system problems also creates a better production environment. Everyone will know what to look for in the future, and what systems might be vulnerable. You can make changes in your cloud environments based on your results.

The Unpredictability of the Cloud

One of the biggest concerns about the cloud is its relative unpredictability. Netflix introduced chaos engineering to combat their concerns about the cloud. The services that cloud platforms rely on can be inconsistent and chaos engineering is the perfect way to manage this.

Containers, microservices, and distributed systems are becoming a staple for cloud computing. These tools are incredibly useful, but they must be properly maintained. Any cloud provider may be vulnerable to occasional downtime. How you deal with cloud-related problems shouldn’t be figured out through hypotheticals or during a real outage. Chaos engineering can be the unpredictability that the cloud brings. You can use it to discover what to do with your systems in a non-critical environment. Simulating failure allows IT teams to verify that cloud systems are behaving as expected. This kind of tool is invaluable.

Chaos Can Be Fun!

If there were a real zombie apocalypse, it would be nowhere near as enjoyable as a video game or film. The same goes for failure. Simulating failure can easily be turned into a fun activity. You can specify when the failure is going to happen and can prepare a game day around it. These simulated failures have no real consequence, so it’s a great way to channel your love for computing!

Start your cloud training journey with Cloud Academy. Check out more of my posts at Solutions Review here.

Avatar

Written by

Tyler Stearns

I'm the lead editor at Solutions Review's Cloud and Network Monitoring sites. In my writing, I bridge the gap between consumer and technical expert to help readers understand what they're looking for. My passions outside of enterprise technology include film, games, swimming in rivers (only rivers), mechanical keyboards, fun socks, ramen, and goats.


Related Posts

Chester Avey
Chester Avey
— November 7, 2019

Cloud Computing Solutions: 7 Trends for the Future

The world of cloud computing is in a state of flux. Not long ago, the cloud was considered an emerging technology, known only to IT specialists. Today it is a part of everyday life – 96% of businesses use the cloud in one form or another, and this number only looks set to grow. Whether ...

Read more
  • Cloud Computing
  • internet of everything
  • multi-cloud
  • Security
  • SEO
Avatar
Cloud Academy Team
— October 23, 2019

Which Certifications Should I Get?

As we mentioned in an earlier post, the old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and compan...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Avatar
Walter Stone
— October 10, 2019

8 Surprising Ways Cloud Computing Is Changing Education

Cloud computing: Empowering the education industry Over the years, the education industry has come a long way. Teaching and learning are no longer confined to textbooks and classrooms and now reaches computers and mobile devices. Today, learners are always connected — whether they are ...

Read more
  • Cloud Computing
  • education industry
Avatar
Michael Sheehy
— August 19, 2019

What Exactly Is a Cloud Architect and How Do You Become One?

One of the buzzwords surrounding the cloud that I'm sure you've heard is "Cloud Architect." In this article, I will outline my understanding of what a cloud architect does and I'll analyze the skills and certifications necessary to become one. I will also list some of the types of jobs ...

Read more
  • AWS
  • Cloud Computing
Avatar
Andrew Larkin
— August 7, 2019

Disadvantages of Cloud Computing

If you want to deliver digital services of any kind, you’ll need to estimate all types of resources, not the least of which are CPU, memory, storage, and network connectivity. Which resources you choose for your delivery —  cloud-based or local — is up to you. But you’ll definitely want...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud Platform
Avatar
Adam Hawkins
— June 12, 2019

What is Kubernetes? An Introductory Overview

In part 1 of my webinar series on Kubernetes, I introduced Kubernetes at a high level with hands-on demos aiming to answer the question, "What is Kubernetes?" After polling our audience, we found that most of the webinar attendees had never used Kubernetes before, or had only been expos...

Read more
  • Cloud Computing
  • Kubernetes
Avatar
Scott Huntington
— March 25, 2019

How Does Cloud Computing Work?

Whether you're looking to become a cloud engineer or you're a manager wanting to gain more knowledge, learn the basics of how cloud computing works. Are you wondering about how cloud computing actually works? We can help explain the basic principles behind this technology. Cloud comput...

Read more
  • Cloud Computing
Avatar
Guy Hummel
— March 4, 2019

What is Ansible?

What is Ansible? Ansible is an open-source IT automation engine, which can remove drudgery from your work life, and will also dramatically improve the scalability, consistency, and reliability of your IT environment. We'll start to explore how to automate repetitive system administratio...

Read more
  • Ansible
  • Cloud Computing
Avatar
Cloud Academy Team
— February 11, 2019

What is Puppet? Get Started With Our Course

When it comes to building and configuring IT infrastructure, especially across dozens or even thousands of servers, developers need tools that automate and streamline this process. Enter Puppet, one of the leading DevOps tools for automating delivery and operation of software no matter ...

Read more
  • Cloud Computing
  • Puppet
Avatar
Andrew Larkin
— January 15, 2019

2018 Was a Big Year for Content at Cloud Academy

As Head of Content at Cloud Academy I work closely with our customers and my domain leads to prioritize quarterly content plans that will achieve the best outcomes for our customers. We started 2018 with two content objectives: To show customer teams how to use Cloud Services to solv...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud Platform
Avatar
Cloud Academy Team
— December 21, 2018

2019 Cloud Computing Predictions

2018 was a banner year in cloud computing, with Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) all continuing to launch new and innovative services. We also saw growth among enterprises in the adoption of methodologies supporting the move toward cloud-native...

Read more
  • Cloud Computing
  • Cloud Predictions
Albert Qian
Albert Qian
— August 28, 2018

Introducing Assessment Cycles

Today, cloud technology platforms and best practices around them move faster than ever, resulting in a paradigm shift for how organizations onboard and train their employees. While assessing employee skills on an annual basis might have sufficed a decade ago, the reality is that organiz...

Read more
  • Cloud Computing
  • Product Feature
  • Skill Profiles