In the IT world, failure is inevitable. A server might go down, an app may fail, etc. Does your team know what to do during a major outage? Do you know what instances may cause a larger systems failure? Chaos engineering, or chaos as a service, will help you fail responsibly.
It almost sounds counterintuitive to think that failing is one of the best security and reliability measures, but this is what chaos engineering is all about. The simple idea behind it is to create chaotic scenarios to test the systems you have in place. Break things on purpose. It’s so punk!
The Origins of Chaos Engineering
In our blog, we have talked about Site Reliability Engineering before but Chaos engineering is a relatively new phenomenon. It all started with Netflix’s move to the AWS cloud in 2010. Netflix saw the cloud as vulnerable. They believed that no instance in the cloud could guarantee permanent uptime. So, they created Chaos Monkey. Chaos Monkey was designed to randomly disable production instances to ensure survivability during common types of failures.
Chaos Monkey wasn’t enough, though. Netflix wanted to create an entire virtual army of chaos, the Simian Army, which includes: Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, and Chaos Gorilla. I won’t go into each monkey’s function, but the idea is simple: Create chaos, guarantee reliability.
The Simian Army may be a fun tool, but it wasn’t always fun for customers. Some of the monkeys were responsible for customer-related problems. The chaos was too uncontrollable. Effectively managing failure like this requires controlled simulation. Thus, Netflix created Failure Injection Testing.
Failure Injection Testing (FIT) was designed to give developers a “blast radius” rather than unmanaged chaos. Mapping out specific places where the tests will occur eliminates the risks. These tests are supposed to be proactive, giving IT teams real experience in dealing with outages and other common problems. Without FIT, chaos as a service wouldn’t be a viable product for a mass audience. Netflix introduced the FIT practice in 2014 when Kolton Andrus was working at the company. Andrus later became the co-founder of Gremlin, a company that offers chaos as a service.
What is Chaos as a Service?
Chaos as a service isn’t exactly chaotic in its current state. The Simian Army may have caused real chaos, but its use as a service is far more controlled and logical. Essentially, if you could simulate chaos in your day-to-day life to maximize your personal efficiency, wouldn’t you?
Putting out fires is a term I constantly hear about in the world of IT. Networking fires, production fires, release fires, etc. Everything is so reactionary, but it doesn’t have to be. Simulation is the best way to learn how to manage a real-world situation. Think of chaos engineering as an experiment. If you’re performing an experiment, you have a hypothesis. Thus, if you don’t have a clue what will happen during a failure, it might not be the right time to use chaos engineering.
You should have some idea about what will happen after you run a chaos experiment. The original Chaos Monkey may have created mostly random chaos to test its systems, but this approach isn’t optimal. Teams should have some idea of what to expect. Having a detailed knowledge and expectation of your systems will make these experiments more effective. If you’re wrong, you will only better understand your systems and know what to fix.
The Benefits of Chaos Engineering
Now, chaos engineering may sound a lot like testing, but it’s not that simple. The primary difference between testing and chaos engineering is the scale and the results. Testing tools are usually simplistic in practice. You provide a testing tool with a condition, and it gives you a result. There’s only so much that can be learned this way. Chaos engineering creates an experimental scenario to not only test your systems but to test yourself and your team. You might discover far more than you asked for. By causing deliberate failures, IT teams will gain confidence that their systems can deal with failures before they occur in production. All complex cloud systems will eventually fail. Using chaos engineering will allow you to recognize what’s wrong with the system, what you can do to fix it, and how to better deal with failure in real time. Building the most effective system requires experimentation. Chaos engineering allows you to run specific scenarios that could happen at any time while a product or service is live. Running these scenarios allows you to measure specific aspects of a failure. Maybe the scenario returned the exact result you expected, maybe it resulted in something completely new. Either way, you’re able to improve your systems and provide the most reliable service to your customers.
Gaining insight into system problems also creates a better production environment. Everyone will know what to look for in the future, and what systems might be vulnerable. You can make changes in your cloud environments based on your results.
The Unpredictability of the Cloud
One of the biggest concerns about the cloud is its relative unpredictability. Netflix introduced chaos engineering to combat their concerns about the cloud. The services that cloud platforms rely on can be inconsistent and chaos engineering is the perfect way to manage this.
Containers, microservices, and distributed systems are becoming a staple for cloud computing. These tools are incredibly useful, but they must be properly maintained. Any cloud provider may be vulnerable to occasional downtime. How you deal with cloud-related problems shouldn’t be figured out through hypotheticals or during a real outage. Chaos engineering can be the unpredictability that the cloud brings. You can use it to discover what to do with your systems in a non-critical environment. Simulating failure allows IT teams to verify that cloud systems are behaving as expected. This kind of tool is invaluable.
Chaos Can Be Fun!
If there were a real zombie apocalypse, it would be nowhere near as enjoyable as a video game or film. The same goes for failure. Simulating failure can easily be turned into a fun activity. You can specify when the failure is going to happen and can prepare a game day around it. These simulated failures have no real consequence, so it’s a great way to channel your love for computing!
Start your cloud training journey with Cloud Academy. Check out more of my posts at Solutions Review here.
How Does Cloud Computing Work?
Whether you're looking to become a cloud engineer or you're a manager wanting to gain more knowledge, learn the basics of how cloud computing works.Are you wondering about how cloud computing actually works? We can help explain the basic principles behind this technology. Cloud comput...
What is Ansible?
What is Ansible? Ansible is an open-source IT automation engine, which can remove drudgery from your work life, and will also dramatically improve the scalability, consistency, and reliability of your IT environment. We'll start to explore how to automate repetitive system administratio...
What is Puppet? Get Started With Our Course
When it comes to building and configuring IT infrastructure, especially across dozens or even thousands of servers, developers need tools that automate and streamline this process. Enter Puppet, one of the leading DevOps tools for automating delivery and operation of software no matter ...
2018 Was a Big Year for Content at Cloud Academy
As Head of Content at Cloud Academy I work closely with our customers and my domain leads to prioritize quarterly content plans that will achieve the best outcomes for our customers.We started 2018 with two content objectives: To show customer teams how to use Cloud Services to solv...
2019 Cloud Computing Predictions
2018 was a banner year in cloud computing, with Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) all continuing to launch new and innovative services. We also saw growth among enterprises in the adoption of methodologies supporting the move toward cloud-native...
Introducing Assessment Cycles
Today, cloud technology platforms and best practices around them move faster than ever, resulting in a paradigm shift for how organizations onboard and train their employees. While assessing employee skills on an annual basis might have sufficed a decade ago, the reality is that organiz...
Cloud Skills: Transforming Your Teams with Technology and Data
How building Cloud Academy helped us understand the challenges of transforming large teams, and how data and planning can help with your cloud transformation.When we started Cloud Academy a few years ago, our founding team knew that cloud was going to be a revolution for the IT indu...
Disadvantages of Cloud Computing
If you want to deliver digital services of any kind, you’ll need to compute resources including CPU, memory, storage, and network connectivity. Which resources you choose for your delivery, cloud-based or local, is up to you. But you’ll definitely want to do your homework first. In this...
Announcing Skill Profiles Beta
Now that you’ve decided to invest in the cloud, one of your chief concerns might be maximizing your investment. With little time to align resources with your vision, how do you objectively know the capabilities of your teams?By partnering with hundreds of enterprise organizations, we’...
A New Paradigm for Cloud Training is Needed (and Other Insights We Can Democratize)
It’s no secret that cloud, its supporting technologies, and the capabilities it unlocks is disrupting IT. Whether you’re cloud-first, multi-cloud, or migrating workload by workload, every step up the ever-changing cloud capability curve depends on your people, your technology, and your ...
AWS re:Invent 2017: Themes and Tools Shaping Cloud Computing in 2018
As the sixth annual re:Invent approaches, it’s a good time to look back at how the industry has progressed over the past year. How have last year’s trends held up, and what new trends are on the horizon? Where is AWS investing with its products and services? How are enterprises respondi...
Cloud Academy at Cloud Expo Santa Clara, Oct 31 – Nov 2
71% of IT decision-makers believe that a lack of cloud expertise in their organizations has resulted in lost revenue.1 That’s why building a culture of cloud—and the common language and skills to support cloud-first—is so important for companies who want to stay ahead of the transfo...