Designing for Failure
Designing for Failure

This course covers the core learning objective to meet the requirements of the 'Designing Network and Data Transfer solutions in AWS - Level 3' skill

Learning Objectives:

  • Evaluate advanced techniques to detect for failure and service recoverability
  • Create an automated, cost-effective back-up solution that supports business continuity across multiple AWS Regions
  • Evaluate different architectures to provide application and infrastructure availability in the event of an AWS service disruption

Multiple availability zones and regions can and will add complexity to your overall cloud networking architecture. However, the benefits outweigh the difficulties of the initial setup. Let's take a look at this architecture that shows best practices for multi-AZ and multi-region deployments. In this example, you'll have to implement an Auto Scaling group that spans 3 AZs, as well as deploy an RDS Database with a synchronized standby replica. This is certainly complex. That said at the end, if you look closely, you end up with a design that does not have a single point of failure. Your application has multiple redundant servers, your load balancer is spread across multiple servers, and finally, your database and other services, in this case, S3, Dynamo, and NAT Gateways, are all redundant. This design will survive an event that takes out up to two availability zones.

The only way to cause a complete outage here would be to have an entire regional outage or an AWS service-specific outage, this one being more likely. Keep in mind that cost is a factor to consider. Since we have multiple copies of everything here, we certainly get a benefit, but if this is too expensive for your needs, you may want to consider using backups instead. If cost isn't a factor, however, but high availability is, consider a multi-region deployment. If we look at this diagram and consider the exact same application we had before, in this case, we would be in great position to survive an entire regional outage. For most use cases, this is overkill. But if you have software with very high-level of availability requirements, say, a national banking web application, you might want to consider this design. In this case, you will use cloud native multi-regional services such as Route 53 and CloudFront. Using health checks, Route 53 can detect an outage and quickly reroute traffic from your primary region to your secondary.

In each region, you'll have your typical software architecture of load balancing servers, application servers, and a replica of your production database. CloudFront can help with caching static assets and delivering them as close to your customers as possible so that they don't even notice that your application has switched to a different region. Keep in mind that you can still be affected by AWS service outages. But other than that, this is going to be your top-of-the-line option for disaster recovery. Considering that AWS has a global footprint, where do you deploy your primary servers? If you're using two regions, the logical assumption is to use regions that are separated by a significant distance while remaining in the same jurisdiction, ideally, the same country, and if not possible, at a minimum, the same continent. But once you have chosen your regions, which one is going to be your primary and which one your standby? The answer is, let's find out where most of our customers are located and use that as our primary.

Yes, we have services like CloudFront that can move content globally very quickly. But that said, you will most likely have dynamic content that needs to be pulled from a database and customized for a specific user. In this case, you would want your database and backend servers to be able to deliver this dynamically generated content very quickly. If you have the budget to support a full active, active deployment, that is a service that is fully operational in two or more regions, then you can use a geolocation feature from Route 53 to send users to the location that's closest to where they are. In the event that an entire region fails, your users may notice a bit of latency as Route 53 routes them over to a different location, but the important thing here is that your application is still running. So far, I've mentioned Route 53 as a way to automatically reroute users to a working copy of your application in the event of a failure. But what about intra-region?

If your application only runs in a single region, consider these options: Load balancer health checks. Their purpose here is to detect any and all EC2 servers that are not responding and prevent our load balancer from sending traffics to those servers. This can also trigger an Auto Scaling group to spin up new instances, perhaps in a different AZ, if the main one isn't available. A load balancer can provide efficient health-checking, in this case, so please consider this option. Multi-AZ RDS Databases. RDS is already designed with this feature, so why not take full advantage? All you have to do is specify multi-AZ during your initial setup, and you'll end up with this architecture. A read replica that is ready to jump in as your primary as soon as something goes wrong with your current primary database.


About the Author
Carlos Rivas
Sr. AWS Content Creator
Learning Paths

Software Development has been my craft for over 2 decades. In recent years, I was introduced to the world of "Infrastructure as Code" and Cloud Computing.
I loved it! -- it re-sparked my interest in staying on the cutting edge of technology.

Colleagues regard me as a mentor and leader in my areas of expertise and also as the person to call when production servers crash and we need the App back online quickly.

My primary skills are:
★ Software Development ( Java, PHP, Python and others )
★ Cloud Computing Design and Implementation
★ DevOps: Continuous Delivery and Integration