Designing Resilient Architectures


How to design high availability and fault tolerant architectures
Designing for disaster recovery / business continuity
2h 21m

Designing Resilient Architectures. In this module, we explore the concepts of business continuity and disaster recovery, the well-architected framework and the AWS services that help us design resilient, fault-tolerant architectures when used together.

We will firstly introduce the concepts of high availability and fault tolerance and introduce you to how we go about designing highly available, fault-tolerant solutions on AWS. We will learn about the AWS Well Architected Framework, and how that framework can help us make design decisions that deliver the best outcome for end users. Next, we will introduce and explain the concept of business continuity and how AWS services can be used to plan and implement a disaster recovery plan.

We will then learn to recognize and explain the core AWS services that when used together can reduce single points of failure and improve scalability in a multi-tier solution.  Auto Scaling is a proven way to enable resilience by enabling an application to scale up and down to meet demand. In a hands-on lab we create and work with Auto Scaling groups to improve add elasticity and durability. Simple Queue service increases resilience by acting as a messaging service between other services and applications, thereby decoupling layers, reducing dependency on state. Amazon Cloudwatch is a core component of maintaining a resilient architecture - essentially it is the eyes and ears of your environment, so we next learn to apply the Amazon CloudWatch service in a hands-on environment. 

We then learn to apply the Amazon CloudFront CDN service to add resilience to a static website that is served out of Amazon S3. Amazon Cloudfront is tightly integrated with other AWS services such as Amazon S3, AWS WAF and Amazon GuardDuty making Amazon CloudFront an important component to increasing the resilience of your solution.


- [Instructor] A company typically decides on an acceptable business continuity plan based on the financial impact to the business when systems are unavailable. The company determines the financial impact by considering many factors such as the loss of business and the damage to its reputation due to downtime or the lack of systems availability. Now, the common matrix for business continuity commonly referred to, recovery time objective and the recovery point objective. Let's just delve into these two concepts. Recovery time objective or the RTO, is the time it takes after a disruption to restore a business process to its service level as was defined by the operational level agreement. For example, if a disaster occurs at 12 o'clock lunchtime and the RTO is eight hours, then the disaster recovery process should restore the business process to the acceptable service level by 8:00 p.m. Now, the recovery point objective or RPO, is the acceptable amount of data loss measured in time. So, just to confuse everyone, it is also a time value but it's a slightly different one. The two are quite different concepts. So, the acceptable amount of data loss measured in time. For example, if that disaster occurred at 12 o'clock around lunchtime, and the RPO is one hour, the system should recover all data that was in the system before 11:00 a.m. So, the data loss will spend only one hour between 11:00 a.m. and 12:00 p.m. They're quite different, aren't they? Like the recovery point objective is, what's the last point in the data, can we successfully absorb if there is an outage? And for a highly transactional business, that's going to be extremely low. I mean, even having an hour of data loss if you're dealing with customer transactions is not going to be acceptable to a transactional business. And that's going to impact how we design our systems to be as highly available and as fault tolerant as possible for a transactional business like that. And another scenario might be that, the business can absorb some outage but does need to have the systems up and running again as soon as possible, so the RTO might be the priority. And part of your business continuity, planning needs to define, what is the priority, the recovery time objective, how quickly we can get the system back up and running again so it can answer queries and requests and be fully functional or is it the recovery point objective that's our priority, that we must be able to go back to the last possible point in time without any data loss? 

So there's a number of different scenarios that we can apply in AWS to help meet the RPOs and RTOs and the first one is what we call, back up and restore. Now, with backup and restore, data is stored as a virtual tape library using AWS storage gateway or another network appliance of similar nature. We can use import and export, AWS import and export to shift large archives or in setting up archives for a backup and restore scenario. Then, in a disaster, archives are recovered from Amazon S3 and restored as if we were using a virtual tape. Now we need to select the appropriate tools and methods to backup our data into AWS. Three things to keep in mind. First, ensure that you have an appropriate retention policy for this data. So, how long are we going to keep these virtual tape archives for, is it six months, is it a year, is it five years, what are the commercial and compliance requirements, etc? 

The second is to ensure that the appropriate security measures are in place for this data, including the encryption and access policies. Can we guarantee that where it's been stored is going to be suitably secure? And third, we need to make sure that we regularly test the recovery of the data and the restoration of the system. The second potential design is what we call pilot light and in pilot light, data is mirrored and the environment is scripted as a template which can be built out and scaled in the unlikely event of a disaster. And a few steps that we need to go through to make pilot light work. First, we set up our Amazon EC2 instances to replicate or mirror our data. Second, we ensure that we have all supporting custom software packages available in AWS. That can be quite an operational overhead to ensure that we have all of the latest and greatest custom software packages that we need for our environment available in AWS. And third, we need to create and maintain Amazon machine images of the key servers we use where fast recovery is required. And then fourth, we need to regularly run these servers, test them and apply any software updates and configuration changes to ensure that they're going to match what our production environment currently is in the event of a disaster. And then fifth, we need to consider automating the provisioning of AWS services as much as possible with cloud formation. What that looks like in our recovery phase, so, in the unlikely event of a disaster, to recover the remainder of our environment around our pilot light, we can start our systems from the Amazon machine images on the appropriate instance types. And for our dynamic data servers, we can resize them to handle production volumes as needed or add capacity accordingly. Basically, horizontal scaling is often the most cost effective and scalable approach to add capacity to the pilot light system. As an example, we can add more web servers at peak times during the day. However, we can also choose larger Amazon EC2 instance types and thus scale vertically for applications such as our relational databases and file storage for example. And any required DNS updates can be done in parallel. The third scenario we can implement, is what we call warm stand by. And our key steps for preparation in a warm standby which is as it says, essentially ready to go with all key services running in the most minimal possible way. First, we set up our Amazon EC2 instances to replicate or mirror data. Secondly, we create and maintain Amazon machine images as required. 

Third, we run our application using a minimum footprint of AWS EC2 instances or AWS infrastructure. It's, basically the bare minimum that we can get by with. And fourth, we patch an update software and configuration files in line with our live environment. We're essentially running a smaller version of our full production environment. Then during our recovery phase, in the case of failure of the production system, the standby environment will be scaled up for production load and the DNS records will be changed to route all traffic to the AWS environment. Now, our fourth potential scenario is what we call multi-site. With multi-site, we set up our AWS environment to duplicate our production environment. Essentially, we've got a mirror of our production running in AWS. Firstly, we set up DNS waiting or a similar traffic routing technology if we're not using route 53 to distribute incoming requests to both sites. We also configure automated failover to reroute traffic away from the affected site in the event of an outage. Now, in our recovery phase, traffic is cut over to the AWS infrastructure by updating the DNS record in route 53 and all traffic and supporting data queries are supported by the AWS infrastructure. 

Now, multi-site scenario is usually the preferred one and where time is a priority, recovery time and recovery point time are priorities and costs are not the main constraint, the next would be the ideal scenario. One key thing to ensure when we're running any of our scenarios is to ensure that we test the recovered data. Once we've restored our primary site to a working state, we then need to restore to a normal service which is often referred to as a fail back process. Depending on your DR strategy, this typically means reversing the flow of data replication so that any data updates received while the primary site was down can be replicated back without loss of data. Here's the first for backup and restore. First, we freeze the data changes on the DR site. Second, we take a backup. Third, we restore the backup to the primary site. Fourth, we re-point users to the primary site and five, we unfreeze the changes. With pilot light, warm standby and multi-site, first, we establish reverse mirroring and replication from the DR site back to the primary site, once the primary site has caught up with the changes. Second, we freeze data changes to the DR site and then third, we re-point users to the primary site. And then finally, we unfreeze the changes. Now most of those scenarios involve some sort of replication of data so let's just talk through some of the considerations on that. When you replicate data to a remote location, you really need to think through a number of factors. First, the distance between the sites. Now, larger distances typically are subject to more latency or jitter. What is the available bandwidth? The breadth and variability of the interconnections is going to be important. If that bandwidth doesn't support high burst activity then it's not going to suit some replication models. And what is the data rate required by your applications? The data rate should be lower than the available bandwidth. And what is the replication technology that you plan to use? The replication technology should be parallel so that it can use the network effectively. 

Let's just look through a couple of the replication options we have and these can be a bit confusing, so let's just take this step by step. There's two types of replication; synchronous replication and asynchronous replication. These two can be very confusing when you're sitting in the exam trying to remember which one is which. So, let's just step through this and hope to give you some tips for how to remember it. With synchronous replication, data is atomically updated in multiple locations. So this puts a dependency on network performance and availability. When deploying a multi-AZ mode, Amazon RDS uses synchronous replication to duplicate data to a second availability zone. This ensures that data is not lost if the primary availability zone becomes unavailable. The other type of replication is asynchronous replication. And with asynchronous replication, data is not atomically updated in multiple locations. It is transferred as network performance and availability allows and the application continues to write data that might not be fully replicated yet. So many database systems support asynchronous data replication. The database replica can be located remotely and the replica does not have to be completely synchronized with the primary database server. And that's acceptable in many scenarios, for example, as a backup source or reporting read-only use cases. In addition to database systems, you can also extend asynchronous replication to network file systems and data volumes. 

Some of the AWS tools that we can use in all of the three scenarios. First one, AWS import and export. AWS import/export accelerates moving large amounts of data in and out of AWS by using portable storage devices for transport. AWS import/export bypasses the internet and transfers your data directly onto and off of storage devices by means of high-speed internal networks at Amazon. For datasets of large size, AWS import/export is often faster than Internet transfer and more cost effective than upgrading your connectivity. And you can use AWS import/export to migrate data in and out of Amazon S3 buckets and Amazon Glacier Vaults or into Amazon EBS snapshots. In backup and recovery modes, it's a perfect way of being able to move data off-site and back on-site quickly when you need to. And AWS import/export snowball is a fantastic device that you literally get shipped to you, you put the data back onto it and then you ship it back. Another tool is AWS Storage Gateway. AWS Storage Gateway is a service that connects an on-premise software appliance with cloud-based storage to provide seamless and highly secure integration between your on-premise IT environment and the storage infrastructure of AWS. AWS Storage Gateway supports three different configurations. First, gateway cached volumes, where you can store your primary data in Amazon S3 and retain your frequently accessed data locally. Now, the Gateway cached volumes provide substantial cost savings on primary storage and they minimize the need to scale your storage on-premise and they retain low latency access to your frequently accessed data. 

The second option is gateway stored volumes. That's good in the event where you need low latency access to your entire dataset and you can configure your gateway to store your primary data locally and asynchronously back up point-in-time snapshots of this data to Amazon S3. Gateway stored volumes provide durable and inexpensive off-site backups that you can recover locally or from Amazon EC2, if for example, you need replacement capacity for disaster recovery. Now the third option with storage gateway, is virtual tape libraries or gateway VTL. And with gateway VTL, you can have an almost limitless collection of virtual tapes that are stored on the virtual tape library. It feels and looks like a virtual tape library to you and your users. All three of these options can be mapped as I scuzzy drive so it's seamless to the end-user. It can be set up from the AWS console and with the Gateway VTL, virtual tape libraries, you can also archive those to Amazon Glacier. All three are very effective for backup and recovery in disaster recovery scenarios. Right, very good. Let's do a quick summary of the four options we have for disaster recovery, so you're well prepped for your exam. Backup and restore, like using AWS as a virtual tape library. It's likely to have the highest recovery time objective because we need to factor in the time it will take us to access or download backup archives. Our recovery point objective is most likely to be quite high as well because if we're only doing daily backups, it could be up to 24 hours. With pilot light, we've got that minimal version of our environment running on AWS which can be expanded to full size when needed. We've got potentially a lower recovery time objective than we would for backup and restore but we need to factor in that we still may need to install applications or patches onto our AMIs before we have a fully operational system. Our recovery point objective, is going to be since the last snapshot, so it's going to be reasonably low. For warm standby, we've got that scaled-down version of a fully functional environment always running. So, our recovery time objective is likely to be lower than pilot light as some of our services are always running. Our recovery point objective is ideally going to be quite low since it will be since our last data write if it's a master slave multi-AZ database. Even if it's asynchronous only, it's still going to give us quite a good recovery point objective. And the benefit of the having a warm standby environment is that we can actually use it for dev tests or for one-off projects or for skunk works, etc. Multi-site is that fully operational version of our environment running off-site or in another region and it's likely to give us our lowest recovery time objective, if we're using active-active failover, it could be a matter of seconds. With our recovery point objective likewise, it depends on the choice of data replication that we choose. But it's going to be since our last asynchronous or synchronous DB write and using route 53 as an active-active failover, it's going to give us a very very aggressive, short recovery point objective and recovery time objective. The considerations with that, is that the cost is going to be reasonably higher proportionately than the other three options we have and we need to factor in that there will be some ongoing maintenance required to keep that kind of environment running. The benefit, is that we have a way of regularly testing our DR strategy, we also have a way of doing Blue-Green deployments and it gives us a lot more diversity in our IT infrastructure.

About the Author
Learning Paths

Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built  70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+  years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.