Designing for Failure
Designing for high availability, fault tolerance and cost efficiency
High Availability in RDS
High Availability in Amazon Aurora
High Availability in DynamoDB
The course is part of this learning path
This section of the Solution Architect Associate learning path introduces you to the High Availability concepts and services relevant to the SAA-C03 exam. By the end of this section, you will be familiar with the design options available and know how to select and apply AWS services to meet specific availability scenarios relevant to the Solution Architect Associate exam.
- Learn the fundamentals of high availability, fault tolerance, and back up and disaster recovery
- Understand how a variety of Amazon services such as S3, Snowball, and Storage Gateway can be used for back up purposes
- Learn how to implement high availability practices in Amazon RDS, Amazon Aurora, and DynamoDB
A company typically decides on an acceptable business continuity plan based on the financial impact to the business when systems are unavailable. So the company determines the financial impact by considering many factors, such as the loss of business and the damage to its reputation due to downtime or the lack of systems availability.
Now, the common metrics for business continuity, commonly refer to Recovery Time Objective and the Recovery Point Objective. So let's just delve into these two concepts.
So Recovery Time Objective or the RTO, is the time it takes after a disruption to restore a business process to its service level as was defined by the operational level agreement. So for example, if a disaster occurs at 12 o'clock lunchtime and the RTO is eight hours, then the disaster recovery process should restore the business process to the acceptable service level by 8:00 p.m.
Now the Recovery Point Objective or RPO is the acceptable amount of data loss measured in time. So just to confuse everyone, it is also a time value, but it's a slightly different one. The two are quite different concepts. So the acceptable amount of data loss measured in time. So for example, if that disaster occurred at 12 o'clock around lunchtime and the RPO is one hour, the system should recover all data that was in the system before 11:00 a.m. So the data loss will spend only one hour between 11:00 a.m and 12:00 p.m.
So they're quite different, aren't they? Like the Recovery Point Objective is what's the last point in the data can we successfully absorb if there is an outage and for a highly transactional business, that's going to be extremely low. I mean, even having an hour of data loss, if you're dealing with customer transactions, is not gonna be acceptable to a transactional business.
So, and that's gonna impact how we design our systems to be as highly available and as fault tolerance possible for a transactional business like that. And another scenario might be that the business can absorb some outage, but it does need to have the systems up and running again as soon as possible. So the RTO might be the priority. And part of your business continuity planning needs to be, to define what is the priority, the Recovery Time Objective, i.e how quickly we can get the system back up and running again so it can answer queries and requests and be fully functional, or is it the Recovery Point Objective that's our priority that we must be able go back to the last possible point in time without any data loss.
So there's a number of different scenarios that we can apply in AWS to help meet the RPOs and RTOs. And the first one is what we call backup and restore now with backup and restore data is stored as a virtual tape library using AWS storage gateway or another network appliance or of a similar nature.
We can use import and export AWS import and export to shift large archives or in setting up archives for a backup and restore scenario. Then in the disaster, archives are recovered from Amazon S3 and restored as if we were using a virtual tape.
Now we need to select the appropriate tools and methods to back up our data into AWS. Three things to keep in mind first, ensure that you have an appropriate retention policy for this data. So how long we go to keep, these virtual tape archives for, is it six months? Is it a year? Is it five years? What are the commercial and, compliance requirements, etc?
The second is to ensure that the appropriate security measures are in place for this data, including the encryption and access policies. So can we guarantee that where it's been stored is gonna be suitably secure and third, we need to make sure that we regularly test the recovery of the data and the restoration of the system.
Alright, so the second potential design is what we call pilot light. And in pilot light data is merit and the environment is scripted as a template, which can be built out and scaled in the unlikely event of a disaster and a few steps that we need to go through to make pilot light work.
First, we set up Amazon EC2 instances to replicate or mirror our data. Second, we ensure that we have all supporting custom software packages available in AWS. So that can be quite an over operational overhead to ensure that we have all of the latest and greatest custom software packages that we need for our environment available in AWS. And third, we need to create and maintain Amazon machine images of a key service where fast recovery is required. And then fourth, we need to regularly run these servers, test them and apply any software updates and configuration changes to ensure that they're going to match what our production environment currently is in the event of a disaster. And then fifth we need to consider automating the provisioning of AWS services as much as possible with cloud formation.
So what that looks like in our recovery phase. So in the unlikely event of a disaster to recover the remainder of our environment around our pilot light, we can start our systems from the Amazon machine images on the appropriate instance types. And for our dynamic data servers, we can resize them to handle production volumes as needed or add capacity accordingly.
So basically horizontal scaling is often the most cost effective and scalable approach to add capacity to the pilot light system. As an example, we can add more web servers at peak times during the day. However, we can also choose larger Amazon EC2 instance types and thus scale vertically for applications such as our relational databases and file storage, for example. And any required DNS updates can be done in parallel.
Okay, so the third scenario we can implement is what we call warm stand by. And our key steps for preparation in a warm stand by, which is, as it says, essentially, ready to go with all key services running in the most minimal possible way.
So, first we set up our Amazon EC2 instances to replicate or mirror data. Secondly, we create a maintain Amazon machine images as required, third, we run our application using a minimum footprint of AWS EC2 instances or AWS infrastructure. So it's basically the bare minimum that we can get by with. And forth, we patch and update software and configuration files in line with our live environment. So we're essentially running a smaller version of our full production environment.
Then during our recovery phase, in the case of failure of the production system, the standby environment will be scaled up for production load. And the DNS records will be changed to route all traffic to the AWS environment.
Now our fourth potential scenario is what we call multi-site with multi-site we set up our AWS environment to duplicate our production environment. So essentially we've got a mirror of reproduction running in AWS. Firstly, we set up DNS waiting or a similar traffic routing technology if we're not using route 53 to distribute incoming requests to both sites, we also configure automated fail over to reroute traffic away from the affected site in the event of an outage.
Now in our recovery phase, traffic is cut over to the AWS infrastructure by updating the DNS record in Route 53 and all traffic and supporting data queries are supported by the AWS infrastructure. Our multi-site scenario is usually the preferred one and where time is a priority Recovery Time and Recovery Point Time are our priorities and costs are not the main constraint the nets would be the ideal scenario.
Okay, so one key thing to ensure when we're running any of our scenarios is to ensure that we test the recover data. So once we've restored our primary site to a working state, we then need to restore to a normal service, which is often referred to as a fail back process. So depending on your DR strategy, this typically means reversing the flow of data replication so that any data updates received while the primary site was down can be replicated back without loss of data.
Here's the first for backup and restore. First we freeze the data changes on the DR site. Second, we take it back up. Third, we restore the backup to the primary site. Fourth, we re point users to the primary site and five, We unfreeze the changes. With pilot light, warm stand by and multi-site first we establish reverse mirroring and replication from the DR site back to the primary site. once the primary site has caught up with the changes. Second, we freeze data changes to the DR site. And then third, we re point users to the primary site. And then finally we unfreeze the changes.
Now most of those scenarios involve some sort of replication of data so let's just talk through some of the considerations on that. When you replicate data to a remote location, you really need to think through a number of factors. First, the distance between the sites now larger distances, typically a subject more latency or jitter.
What is the available bandwidth? The breadth in variability of the interconnections is going to be important. If that bandwidth doesn't support high burst activity, then it's not gonna suit some replication models. And what is the data rate required by your applications? The data rates should be lower than the available bandwidth. And what is the replication technology that you plan to use? The replication technology should be parallel so that it can use the network effectively.
So let's just look through a couple of the replication options we have, and these can be a bit confusing. So let's just take this step by step. Okay, there's two types of replication, synchronous replication and asynchronous replication. These two can be very confusing when you're sitting in an exam trying to remember which one is which so let's just step through this and hope to give you some tips for how to remember it.
With synchronous replication data is atomically updated in multiple locations. So this puts a dependency on network performance and availability. So when deploying a multi-AZ mode, Amazon RDS uses synchronous replication to duplicate data to a second availability zone. This ensures that data is not lost. If the primary availability zone becomes unavailable.
Now, the other type of replication is asynchronous replication. And with asynchronous replication, data is not atomically updated in multiple locations. It is transferred as network performance and availability allows and the application continues to write data that might not be fully replicated yet. So many database systems support asynchronous data replication, the database replica can be located remotely and the replica does not have to be completely synchronized with the primary database server. And that's acceptable in many scenarios, for example, as a backup source or reporting read only use cases.
In addition to both database systems, you can also extend asynchronous replication to network file systems and data volumes. Right, very good. Let's do a quick summary of the four options we have for disaster recovery so you're well prepped for your exam.
So backup and restore, like using AWS as a virtual tape library, is likely to have the highest Recovery Time Objective because we need to factor in the time it would take us to access or download backup archives. Our Recovery Point Objective is most likely to be quite high as well, because if we're only doing daily backups, it could be up to 24 hours.
With pilot light we've got that minimal version of our environment running on AWS, which can be expanded to full-size when needed. We've got potentially a lower Recovery Time Objective than we would for backup and restore, but we need to factor in that we still may need to install applications or patches onto our AMIs before we have a fully operational system.
Our Recovery Point Objective is going to be since the last snapshot. So it's going to be reasonably low. For warm stand by, we've got that scaled-down version of a fully functional environment, always running. So our Recovery Time Objective is likely to be lower than pilot light as some of our services are always running.
Our Recovery Point Objective is ideally going to be quite low since it will be since our last data write if it's a master-slave multi-AZ database, even if it's asynchronous only, it's still going to give us quite a good Recovery Point Objective. And the benefit of having a warm standby environment is that we can actually use it for Dev test or for one-off projects or for skunkworks, etc. And a multi site is that fully operational version of our environment running off-site or in another region. And it's likely to give us our lowest Recovery Time Objective if we're using active/active fail-over, it could be a matter of seconds, with our Recovery Point Objective, likewise, it depends on the choice of data replication that we choose. But it's gonna be since our last asynchronous or synchronous DB write and using Route 53 as an active/active fail over, it's gonna give us a very, very aggressive, short Recovery Point Objective and Recovery Time Objective.
The considerations with that is that the cost is going to be reasonably higher proportionately than the other three options we have and we need to factor in it that there will be some ongoing maintenance required to keep that kind of environment running.
The benefit is that we have a way of regularly testing our DR strategy. We also have a way of doing Blue-green deployments, and it gives us a lot more diversity in our IT infrastructure.
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.