Designing for Failure
Designing for high availability, fault tolerance and cost efficiency
High Availability in RDS
High Availability in Amazon Aurora
High Availability in DynamoDB
The course is part of this learning path
This section of the Solution Architect Associate learning path introduces you to the High Availability concepts and services relevant to the SAA-C03 exam. By the end of this section, you will be familiar with the design options available and know how to select and apply AWS services to meet specific availability scenarios relevant to the Solution Architect Associate exam.
- Learn the fundamentals of high availability, fault tolerance, and back up and disaster recovery
- Understand how a variety of Amazon services such as S3, Snowball, and Storage Gateway can be used for back up purposes
- Learn how to implement high availability practices in Amazon RDS, Amazon Aurora, and DynamoDB
Designing for failure is one of the most common architectural principles that we should all follow when building and deploying solutions on AWS. The phrase ‘Everything fails all of the time’ is a quote made by Dr. Werner Vogels, VP and CTO of AWS, and he’s said this a number of times, and this has helped to reinforce its importance as a design principle. It’s this mindset of understanding that failure will happen at some point within your architecture when running on the cloud, and we need to be prepared for it.
We don’t know when, how or why something might fail, but what we can do is prepare for failure and put steps in place to ensure we can recover from it both quickly and effectively, and so in effect, we design for failure. So how do we do this? Well let’s start with a simple scenario to see how we can improve it using the AWS global infrastructure as we go through different stages of design, all to help with the prevention of failure.
Let’s assume you have created a website which talks to a database on the back-end. This needs to be deployed in AWS, and so you create a new VPC in a single region, with a public subnet and a private subnet within the same availability zone. You provision a single RDS database in your private subnet, and a single EC2 instance in your public subnet acting as your web server. A very simple configuration, but it’s peppered with issues from a ‘design for failure’ perspective.
If an incident occurred with this design which affected the availability of us-east-1a availability zone, then your entire infrastructure would be impacted and inaccessible. Design for failure has clearly not been taken into account here!
Firstly, when deploying a new solution I would always suggest you use as much of the geographical infrastructure as possible to help you implement levels of high availability and remove single points of failure from your architectural designs. So this starts by having a grasp of the AWS Global architecture, primarily focusing on Availability Zones (AZ) and Regions.
Availability zones are essentially the physical data centers of AWS. This is where the actual compute, storage, network, and database resources are hosted that we as consumers provision virtually within our Virtual Private Clouds (VPCs). A common misconception is that a single availability zone is equal to a single data center. This is not the case, multiple data centers located close together can form a single availability zone.
Each AZ will always have at least 2 other AZs that are geographically located within the same area, usually a city, linked by highly resilient and very low latency private fiber optic connections. However, each AZ will be isolated from the others using separate power and network connectivity that minimizes impact to other AZs should a single AZ fail. These low latency links between AZs are used by many AWS services to replicate data for high availability and resilience purposes. Often, there are three, four, five, or more AZs linked together via these low latency connections. This localized geographical grouping of multiple AZs is defined as an AWS Region.
So a region is a collection of availability zones that are geographically located close to one other. This is generally indicated by AZs within the same city. AWS has deployed them across the globe to allow its worldwide customer base to take advantage of low latency connections. Every Region will act independently of the others, and each will contain at least three Availability Zones.
So with this flexibility of being able to deploy your resources across different geographical locations, it makes logical sense to adopt as many of these as business and operational requirements dictate. If your solution or service doesn’t need to span across multiple regions, then you should certainly consider the option of deploying your resources across multiple availability zones.
So let’s circle back to our basic design that we had in our scenario to see how we can improve it. In our original design our VPC was using a single AZ for both subnets. This means that if a disaster occured which impacted the same AZ that we are using in our VPC then it would impact 100% of our resources making them inaccessible. So, let’s re-architect this using multiple AZs.
So as you can see in this newly designed VPC we are now using more than 1 AZ for both the Private and Public subnets, us-east-1a and us-east-1b. By doing so we have also deployed additional resources. An additional Web server has been provisioned in the us-east-1b AZs public subnet in addition to an Application Load balancer to manage the traffic destined for the target group of the web servers. Application load balancers allow you to provide a flexible feature set including advanced routing and visibility features aimed for application architectures and commonly used for applications to balance HTTP or HTTPS traffic. To learn more about load balancers, please see our existing course here:
You will also notice that we have introduced a standby RDS instance through the use of a multi-az RDS deployment. When multi-az is configured, a secondary RDS instance, known as a replica, is deployed within a different availability zone within the same region as the primary instance. Its single and only purpose is to provide a failover option for the primary RDS instance should it become unavailable. To learn more about using RDS in a multi-az configuration, please refer to our existing course here. With these small configurational changes of utilizing an additional AZ we have now added high availability into our design in case of an AZ failure.
Let’s suppose a natural disaster occured and the availability of us-east-1a az became unstable and unavailable for a period of time, what would happen? You would lose half of your infrastructure, you’d lose access to your EC2 web server and your primary RDS instance, but your solution would still be operational, as you planned for such an event. The application load balancer would continue to receive and deliver traffic to your remaining web server in the target group in the us-east-1b az. Similarly, the failure of the Primary RDS instance would trigger an automated response that would promote the standby instance as a new Primary RDS database and continue to serve both write and read requests. This is a very high level example of ‘designing for failure’. Having the foresight to implement a solution that recovers from a major outage has ensured you remain operational and continue to serve your customers.
Now of course failure doesn’t just occur from a global infrastructure perspective, we can experience outages and failures at all different levels of a solution, for example you might experience an incident with a single EC2 instance, perhaps in our scenario, your website started to gain more attention and traffic load increased, even with 2 EC2 instances, one in each AZ the performance was not enough to meet the demand. This drain of your EC2 performance resulted in poor performance resulting in an outage, this is a similar effect to what happens when a denial of service attack occurs. To rectify this, we could add additional features to help with high availability, such as Auto-scaling groups for our EC2 instances to ensure we maintain enough capacity to serve traffic requests and to recover from instance failures.
However, in this course I want to maintain focus on the global infrastructure elements of the design. With this in mind, we need to realize that failures don’t just occur at the AZ level, there can be consequences that occur at the Regional level too, creating a much wider problem! If there was a regional failure of some sort, for example a service disruption to both EC2 and RDS across the us-east-1 region then this current design doesn’t provide enough flexibility to recover. You might be thinking, ‘yeah but, how likely is that going to happen?’ In reality, the answer is not very likely at all, but it can and HAS happened. Take a look at this summary report written by AWS covering the Amazon EC2 and Amazon RDS Service Disruption in the US East Region that happened a number of years ago:
Obviously, architecting for a regional failover carries a lot more cost, management and design features as you may need to duplicate a lot of your resources, in addition to maintaining data synchronization and data transfer, and deal with potential latency issues too. So when it comes down to how far you architect for ‘design for failure’ at the global infrastructure level, you need to determine the business impact of when your solution becomes unavailable that you’re trying to design for failure for. This includes understanding the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) of your solution.
For example, would it cost your business 10’s of thousands of dollars if you incurred downtime for just a few seconds, effectively having a very short RTO? If so, then operating in a single Region might not be a great idea, it’s likely that you’ll want to architect a solution with regional failover capabilities. However, if you had a much longer RTO, and you incurred downtime due to a regional failure, then a multi availability zone design might be a better solution for you and meet the needs of your business more effectively. It’s a case of weighing up the cost - benefit - risk of your deployment design.
In our simple scenario of a web layer being run by EC2 instances, and a database layer using RDS there are a couple of points we’d need to consider when deploying a multi-regional solution for DR purposes. Such as the inclusion of design considerations with using Route 53 and Amazon CloudFront which would both sit in front of our Application Load Balancers
From a database perspective you would also need to consider data replication, and this would depend on a couple of factors as to how you implemented this replication which would depend on your RTO and RPO. If these values were longer rather than shorter, then one of the most cost effective methods on managing this replication would be to use a snapshot and restore approach. In this instance, you would simply perform the following steps.
Firstly you would implement a schedule of when you would create a snapshot of your primary RDS database. This can be automated using other services such as AWS Lambda and CloudWatch EventBridge
Next, you would need to copy these snapshots to your designated DR Region.
In the event of a disaster, you would quickly be able to get operational again in a new region using the snapshot and copy approach. However, like I say, this is only appropriate if your business RTO and RPO allow for this approach as copying these snapshots could take a number of hours.
Should you have a much shorter RTO and RPO, then this approach might not be very feasible, in which case you would need to adopt a different approach. In this case, it is recommended that you use the AWS Database Migration service. To learn more about this service, please see our existing course here:
If you’d like to learn how to carry out a continuous and on-going replication of your RDS server, then you can visit this AWS documentation here.
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.