1. Home
  2. Training Library
  3. Amazon Web Services
  4. Courses
  5. How to Architect with a Design for Failure Approach

Let's make it fail


Testing against failures
Start course

The gold standard for high availability is five 9s, meaning guaranteed uptime 99.999% of the time. That means just five and a half minutes of downtime throughout an entire year. Achieving this kind of reliability requires some advanced knowledge of the many tools AWS provides to build a robust infrastructure.

In this course, expert Cloud Architect Kevin Felichko will show one of the many possible alternatives for creating a high availability application, designing the whole infrastructure with a Design for Failure. You'll learn how to use AutoScaling, load balancing, and VPC to run a standard Ruby on Rails application on an EC2 instance, with data stored on an RDS-backed MySQL database, and assets stored on S3. Kevin will also touch on some advanced topics like using CloudFront for content delivery and how to distribute an application across multiple AWS regions.

Who should take this course

As an intermediate/advanced course, you will need to have some experience with EC2, S3 and RDS, and at least a basic knowledge of AutoScaling, ELB, VPC, Route 53 and CloudFront.

Test your knowledge of the material covered in this course: take a quiz.

If you have thoughts or suggestions for this course, please contact Cloud Academy at support@cloudacademy.com.


In this, the sixth lesson of our series, we're going to test our architecture by simulating failure events. Each test will target a specific area of our design.

Without any intervention on our part, the site should recover and be available to our users. We are going to run three separate tests on our architecture. Each test will demonstrate our design goals of being fault tolerant, having no single point of failure, graceful degradation of services and self healing. The first test involves shutting down one of our EC2 instances that belongs to our auto scaling group. Our second test turns off auto scaling and shuts down all EC2 instances.

The third and final test will force a fail over of our RDS instance. Let's begin. Our first test is to shut down a random EC2 instance that is governed by our auto scaling group. We start by heading to our EC2 dashboard and view the running instances. Next, we select any instance.

From the actions menu click on terminate and confirm that we actually want to perform this action. While the instance is terminating, we can open up a new tab and hit our website. The site loads without any issues. Jumping back to the EC2 dashboard shows that the instance is finally terminated. Refreshing our site shows that it's still operational. The auto scaling instances tab for this group shows the instance's health status as unhealthy. After some time, the auto scaling group will fire up a new instance in the availability zone that lost the instance. Refreshing the EC2 dashboard shows a new instance has been launched.

Our site is still operational. Eventually this instance will pass its health checks and the elastic load balancer will resume sending traffic to it, demonstrating each one of our design goals.

The second test will demonstrate a complete failover of our primary site to the secondary S3 site. In order to accomplish, this we'll need to set the desired min and max options of our auto scaling group to zero. When saved, the group will begin terminating all running EC2 instances. Once terminated, we will open a new tab and hit our site. We are greeted with our S3 bucket version automatically. Route 53 being unable to reach the primary site, begins serving up the secondary site. Next, we want to fall back on our primary site when it becomes available again. Back on the auto scaling group, we revert our desired min and max options to their original setting of three and save the changes. The auto scaling group will fire up new instances which we can see from the EC2 dashboard. The load balancer still shows that zero of three instances are in service. If you recall, the ELB will take around five minutes to declare an instance healthy enough to send traffic to it. Fast-forwarding each instance moves from out of services status to in service. Back in our other tab, we refresh the page to demonstrate that our primary site is back up and running. For our third and final test, we will reboot the primary RDS instance. In the RDS dashboard, we can see that the primary instance is currently running in us-east-1a. To reboot the instance, we select it and head to the instance actions button. The drop down will display a few options. We want the reboot option. We need to confirm that we actually want to reboot. Before we do that we need to select the reboot with failover box. When checked, the failover starts before the instance is restarted. Unchecked, the reboot will happen and then the failover will start.

We confirm the reboot to continue. It will take a short amount of time for the failover to take effect. To save time we'll fast forward to after the failover is completed. We can see that the primary RDS instance is now running in us-east 1b. A quick peek at the instance log shows it took 34 seconds for the failover to complete. During this time, users to our site will experience a disruption which we will address later. Our focus on RDS is the importance of the self healing aspect.

Before we move on to the next lesson, it's important to understand that there are a variety of tools available to help us test our architectures. An open source tool called Chaos Monkey was built specifically for AWS by Netflix as part of its Simian Army tools. When launched, Chaos Monkey will randomly shut down EC2 instances that belong to an auto scaling group. This can be run on a schedule or can be constantly running at random times. The tool is aimed at ensuring an architecture is capable of running under adverse conditions that might disrupt unprepared services and applications. In our next lesson, we will add one more layer to our architecture to overcome the small disruption in the user experience that we had with the RDS interruption through the use of CloudFront for dynamic content.

About the Author
Kevin Felichko
Solutions Architect

Kevin is a seasoned technologist with 15+ years experience mostly in software development.Recently, he has led several migrations from traditional data centers to AWS resulting in over $100K a year in savings. His new projects take advantage of cloud computing from the start which enables a faster time to market.

He enjoys sharing his experience and knowledge with others while constantly learning new things. He has been building elegant, high-performing software across many industries since high school. He currently writes apps in node.js and iOS apps in Objective C and designs complex architectures for AWS deployments.

Kevin currently serves as Chief Technology Officer for PropertyRoom.com, where he leads a small, agile team.