What is Chaos Engineering? Failure Becomes Reliability

In the IT world, failure is inevitable. A server might go down, an app may fail, etc. Does your team know what to do during a major outage? Do you know what instances may cause a larger systems failure? Chaos engineering, or chaos as a service, will help you fail responsibly.
It almost sounds counterintuitive to think that failing is one of the best security and reliability measures, but this is what chaos engineering is all about. The simple idea behind it is to create chaotic scenarios to test the systems you have in place. Break things on purpose. It’s so punk!
What is Chaos Engineering?

The Origins of Chaos Engineering

Chaos engineering is a relatively new phenomenon. It all started with Netflix’s move to the AWS cloud in 2010. Netflix saw the cloud as vulnerable. They believed that no instance in the cloud could guarantee permanent uptime. So, they created Chaos Monkey. Chaos Monkey was designed to randomly disable production instances to ensure survivability during common types of failures.
Chaos Monkey wasn’t enough, though. Netflix wanted to create an entire virtual army of chaos, the Simian Army, which includes: Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, and Chaos Gorilla. I won’t go into each monkey’s function, but the idea is simple: Create chaos, guarantee reliability.
The Simian Army may be a fun tool, but it wasn’t always fun for customers. Some of the monkeys were responsible for customer-related problems. The chaos was too uncontrollable. Effectively managing failure like this requires controlled simulation. Thus, Netflix created Failure Injection Testing.
Failure Injection Testing (FIT) was designed to give developers a “blast radius” rather than unmanaged chaos. Mapping out specific places where the tests will occur eliminates the risks. These tests are supposed to be proactive, giving IT teams real experience in dealing with outages and other common problems. Without FIT, chaos as a service wouldn’t be a viable product for a mass audience.
Netflix introduced the FIT practice in 2014 when Kolton Andrus was working at the company. Andrus later became the co-founder of Gremlin, a company that offers chaos as a service.

What is Chaos as a Service?

Chaos as a service isn’t exactly chaotic in its current state. The Simian Army may have caused real chaos, but its use as a service is far more controlled and logical. Essentially, if you could simulate chaos in your day-to-day life to maximize your personal efficiency, wouldn’t you?
Putting out fires is a term I constantly hear about in the world of IT. Networking fires, production fires, release fires, etc. Everything is so reactionary, but it doesn’t have to be. Simulation is the best way to learn how to manage a real-world situation.
Think of chaos engineering as an experiment. If you’re performing an experiment, you have a hypothesis. Thus, if you don’t have a clue what will happen during a failure, it might not be the right time to use chaos engineering.
You should have some idea about what will happen after you run a chaos experiment. The original Chaos Monkey may have created mostly random chaos to test its systems, but this approach isn’t optimal. Teams should have some idea of what to expect. Having a detailed knowledge and expectation of your systems will make these experiments more effective. If you’re wrong, you will only better understand your systems and know what to fix.

The Benefits of Chaos Engineering

Now, chaos engineering may sound a lot like testing, but it’s not that simple. The primary difference between testing and chaos engineering is the scale and the results. Testing tools are usually simplistic in practice. You provide a testing tool with a condition, and it gives you a result. There’s only so much that can be learned this way. Chaos engineering creates an experimental scenario to not only test your systems but to test yourself and your team. You might discover far more than you asked for.
By causing deliberate failures, IT teams will gain confidence that their systems can deal with failures before they occur in production. All complex cloud systems will eventually fail. Using chaos engineering will allow you to recognize what’s wrong with the system, what you can do to fix it, and how to better deal with failure in real time.
Building the most effective system requires experimentation. Chaos engineering allows you to run specific scenarios that could happen at any time while a product or service is live. Running these scenarios allows you to measure specific aspects of a failure. Maybe the scenario returned the exact result you expected, maybe it resulted in something completely new. Either way, you’re able to improve your systems and provide the most reliable service to your customers.
Gaining insight into system problems also creates a better production environment. Everyone will know what to look for in the future, and what systems might be vulnerable. You can make changes in your cloud environments based on your results.

The Unpredictability of the Cloud

One of the biggest concerns about the cloud is its relative unpredictability. Netflix introduced chaos engineering to combat their concerns about the cloud. The services that cloud platforms rely on can be inconsistent and chaos engineering is the perfect way to manage this.
Containers, microservices, and distributed systems are becoming a staple for cloud computing. These tools are incredibly useful, but they must be properly maintained. Any cloud provider may be vulnerable to occasional downtime. How you deal with cloud-related problems shouldn’t be figured out through hypotheticals or during a real outage. Chaos engineering can be the unpredictability that the cloud brings. You can use it to discover what to do with your systems in a non-critical environment.
Simulating failure allows IT teams to verify that cloud systems are behaving as expected. This kind of tool is invaluable.

Chaos Can Be Fun!

If there were a real zombie apocalypse, it would be nowhere near as enjoyable as a video game or film. The same goes for failure. Simulating failure can easily be turned into a fun activity. You can specify when the failure is going to happen and can prepare a game day around it. These simulated failures have no real consequence, so it’s a great way to channel your love for computing!
Check out more of my posts at Solutions Review here.

Written by

I'm the lead editor at Solutions Review's Cloud and Network Monitoring sites. In my writing, I bridge the gap between consumer and technical expert to help readers understand what they're looking for. My passions outside of enterprise technology include film, games, swimming in rivers (only rivers), mechanical keyboards, fun socks, ramen, and goats.

Related Posts

Albert Qian
— August 28, 2018

Introducing Assessment Cycles

Today, cloud technology platforms and best practices around them move faster than ever, resulting in a paradigm shift for how organizations onboard and train their employees. While assessing employee skills on an annual basis might have sufficed a decade ago, the reality is that organiz...

Read more
  • Cloud Computing
  • Product Feature
  • Skill Profiles
— July 31, 2018

Cloud Skills: Transforming Your Teams with Technology and Data

How building Cloud Academy helped us understand the challenges of transforming large teams, and how data and planning can help with your cloud transformation.When we started Cloud Academy a few years ago, our founding team knew that cloud was going to be a revolution for the IT indu...

Read more
  • Cloud Computing
  • Skill Profiles
— June 26, 2018

Disadvantages of Cloud Computing

If you want to deliver digital services of any kind, you’ll need to compute resources including CPU, memory, storage, and network connectivity. Which resources you choose for your delivery, cloud-based or local, is up to you. But you’ll definitely want to do your homework first.Cloud ...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud
Albert Qian
— May 23, 2018

Announcing Skill Profiles Beta

Now that you’ve decided to invest in the cloud, one of your chief concerns might be maximizing your investment. With little time to align resources with your vision, how do you objectively know the capabilities of your teams?By partnering with hundreds of enterprise organizations, we’...

Read more
  • Cloud Computing
  • Product Feature
  • Skill Profiles
— April 5, 2018

A New Paradigm for Cloud Training is Needed (and Other Insights We Can Democratize)

It’s no secret that cloud, its supporting technologies, and the capabilities it unlocks is disrupting IT. Whether you’re cloud-first, multi-cloud, or migrating workload by workload, every step up the ever-changing cloud capability curve depends on your people, your technology, and your ...

Read more
  • Cloud Computing
— November 22, 2017

AWS re:Invent 2017: Themes and Tools Shaping Cloud Computing in 2018

As the sixth annual re:Invent approaches, it’s a good time to look back at how the industry has progressed over the past year. How have last year’s trends held up, and what new trends are on the horizon? Where is AWS investing with its products and services? How are enterprises respondi...

Read more
  • AWS
  • Cloud Adoption
  • Cloud Computing
  • reInvent17
— October 27, 2017

Cloud Academy at Cloud Expo Santa Clara, Oct 31 – Nov 2

71% of IT decision-makers believe that a lack of cloud expertise in their organizations has resulted in lost revenue.1  That’s why building a culture of cloud—and the common language and skills to support cloud-first—is so important for companies who want to stay ahead of the transfor...

Read more
  • Cloud Computing
  • Events
— October 24, 2017

Product News: Announcing Cloud Academy Exams, Improved Filtering & Navigation, and More

At Cloud Academy, we’re obsessed with creating value for the organizations who trust us as the single source for the learning, practice, and collaboration that enables a culture of cloud.Today, we’re excited to announce the general availability of several new features in our Content L...

Read more
  • Cloud Computing
— August 29, 2017

On 'the public understanding of encryption' Tweet by Paul Johnston

Some of the questions by journalists about encryption prove they don't get it. Politicians don't seem to get it either (most of them). In fact, outside technology, there are some ridiculous notions of what encryption means. Over and over again, the same rubbish around encrypti...

Read more
  • Cloud Computing
— July 13, 2017

Our Hands-on Labs have a new look

Building new hands-on labs and improving our existing labs is a major focus of Cloud Academy for 2017 and beyond. If you search "types of adult learning," you will get approximately 16.9 gazillion hits. Many will boast about how they meet the needs of a certain type of learner (up to 70...

Read more
  • Cloud Computing
  • hands-on labs
— July 11, 2017

New infographic: Cloud computing in 2017

With 83% of businesses ranking cloud skills as critical for digital transformation in 2017, it’s great news for anyone with cloud architecting experience, and for those considering a career in cloud computing. In our new infographic, we compiled some of the latest industry research to l...

Read more
  • Cloud Computing
— July 7, 2017

Embracing DevOps in your company – an interview with our DevOps expert

On the Cloud Academy Community, we get a lot of questions about DevOps. According to the 2017 State of DevOps Report by the DevOps Research & Assessment and Puppet, DevOps “is viewed as the path to faster delivery of software, greater efficiency, and the ability to pull ahead of the...

Read more
  • Cloud Computing
  • DevOps
  • Security