Designing for disaster recovery / business continuity
Designing for disaster recovery / business continuity
3h 8m

Course Description

The AWS exam guide outlines that 60% of the Solutions Architect–Associate exam questions could be on the topic of designing highly-available, fault-tolerant, cost-efficient, scalable systems. This course teaches you to recognize and explain the core architecture principles of high availability, fault tolerance, and cost optimization. We then step through the core AWS components that can enable highly available solutions when used together so you can recognize and explain how to design and monitor highly available, cost efficient, fault tolerant, scalable systems.

Course Objectives

  • Identify and recognize cloud architecture considerations such as functional components and effective designs
  • Define best practices for planning, designing, and monitoring in the cloud
  • Develop to client specifications, including pricing and cost
  • Evaluate architectural trade-off decisions when building for the cloud
  • Apply best practices for elasticity and scalability concepts to your builds
  • Integrate with existing development environments

Intended Audience

This course is for anyone preparing for the Solutions Architect–Associate for AWS certification exam. We assume you have some existing knowledge and familiarity with AWS, and are specifically looking to get ready to take the certification exam.


Basic knowledge of core AWS functionality. If you haven't already completed it, we recommend our Fundamentals of AWS Learning Path. We also recommend completing the other courses, quizzes, and labs in the Solutions Architect–Associate for AWS certification learning path.

This Course Includes:

  • 11 video lectures
  • Detailed overview of the AWS services that enable high availability, cost efficiency, fault tolerance, and scalability
  • A focus on designing systems in preparation for the certification exam

What You'll Learn

Lecture Group What you'll learn

Designing for High availability, fault tolerance and cost efficiency 

Designing for business continuity 

How to combine AWS services together to create highly available, cost efficient, fault tolerant systems.

How to recognize and explain Recovery Time Objective and Recovery Point Objectives,  and how to recognize and implement AWS solution designs to meet common RTO/RPO objectives

 Ten AWS Services That Enable High Availability Regions and Availability Zones, VPCs, ELB, SQS, EC2, Route53, EIP, CloudWatch, and Auto Scaling 

If you have thoughts or suggestions for this course, please contact Cloud Academy at


Let's talk about disaster recovery so a company typically decides on an acceptable business continuity plan based on the financial impact to the business when systems aren't available so the company determines the financial impact by considering many factors such as the loss of business, and the damage to its reputation due to downtime or the lack of system's availability. Now the common metrics for business continuity commonly referred to recovery time objective, and the recovery point objective so let's just delve into these two concepts. So recovery time objective, or the RTO, is the time it takes after a disruption to restore a business process to its service level as was defined by the operational level agreement. So for example, if a disaster occurs at 12 o'clock lunch time, and the RTO is eight hours, then the disaster recovery process should restore the business process to the acceptable service level by 8pm. Now the recovery point objective or RPO is the acceptable amount of data loss measured in time so just to confuse everyone, it is also a time value but it's a slightly different one. The two are quite different concepts so the acceptable amount of data loss measured in time. So for example, if that disaster occurred at 12 o'clock around lunch time, and the RPO is one hour, the system should recover all data that was in the system before 11am so the data loss will spend only one hour between 11am and 12pm so they're quite different, aren't they? Like the recovery point objective is what's the last point in the data can we successfully absorb if there is an outage? And for a highly transactional business, that's going to be extremely low. Even having an hour of data loss, if you're dealing with customer transactions, is not gonna be acceptable to a transactional business, and that's going to impact how we design our systems to be as highly available, and as fault tolerant as possible for a transactional business like that. In another scenario, it might be that the business can absorb some outage that does need to have the systems up and running again as soon as possible so the RTO might be the priority. And part of your business continuity planning needs to be that to define what is the priority. The recovery time objective or how quickly we can get the system back up and running again so it can answer queries and requests, and be fully functional? Or is it the recovery point objective that's our priority, that we must be able to go back to the last possible point-in-time without any data loss? So there's a number of different scenarios that we can apply in AWS to help meet the RPOs and RTOs, and the first one is what we call backup and restore. Now with backup and restore, data is stored as a virtual tech library using AWS Storage Gateway or another network depliance of similar nature. We can use import and export, AWS import and export to shift large archives or in setting up archives for a backup and restore scenario. Then in a disaster, archives are recovered from Amazon S3, and restored as if we were using a virtual tape. Now we need to select the appropriate tools and methods to backup data into AWS. Three things to keep in mind. First, ensure that you have an appropriate retention policy for this data. So how long are we going to keep these virtual tape archives for? Is it six months? Is it a year? Is it five years worth of the commercial and compliance, requirements, et cetera? The second is to ensure that the appropriate security measures are in place for this data including the encryption and access policies so can we guarantee that where it's being stored is going to be suitably secure? And third, we need to make sure that we regularly test the recovery of the data, and the restoration of the system. All right, so the second potential design is what we call pilot light, and in pilot light, data is mirrored, and the environment is scripted as a template, which can be built out and scaled in the unlikely event of a disaster. And a few steps that we need to go through to make pilot light work. First, we set up Amazon EC2 instances to replicate or mirror our data. Second, we ensure that we have all supporting custom software packages available in AWS so that can be quite an operational overhead to ensure that we have all of the latest and greatest custom software packages that we need for our environment available in AWS. And third, we need to create and maintain Amazon Machine Images of the key service we use where fast recovery is required. And then fourth, we need to regularly run this service, test them, and apply any software updates and configuration changes to ensure that they're going to match what our production environment currently is in the event of a disaster. And then fifth, we need to consider automating the provisioning of AWS services as much as possible with CloudFormation. So what that looks in our recovery phase. So in the unlikely event of a disaster, to recover the remainder of our environment around our pilot light, we can start our systems from the Amazon Machine Images on the appropriate instance types, and for our dynamic data service, we can resize them to handle production volumes as needed or add capacity accordingly. So basically, horizontal scaling is often the most cost effective and scalable approach to add capacity to the pilot light system. As an example, we can add more web service at peak times during the day however, we can also choose larger Amazon EC2 instance types, and thus scale vertically for applications such as our relational databases and file storage, for example, and any required DNS updates can be done in parallel. Okay, so the third scenario we can implement is what we call warm standby. And our key steps for preparation in a warm standby, which is, at it says, essentially ready to go with all key services running in the most minimal possible way. So first, we set up our Amazon EC2 instances to replicate or mirror data. Secondly, we create and maintain Amazon Machine Images as required. Third, we run our application using a minimum footprint of AWS EC2 instances or AWS infrastructure so it's basically the bare minimum that we can get by with. And fourth, we patch and update software and configuration files in line with our live environment. So we're essentially running a smaller version of our full production environment. Then during our recovery phase, in the case of failure of the production system, the standby environment will be scaled up for production load, and the DNS records will be changed to route all traffic to the AWS environment. The fourth potential scenario is what we call multi-site. With multi-site, we set up our AWS environment to duplicate our production environment so essentially we've got a mirror of reproduction running in AWS. Firstly, we set up DNS weighting or a similar traffic routing technology if we're not using route 53 to distribute incoming requests to both sites. We also configure automated failover to reroute traffic away from the affected site in the event of an outage. Now in our recovery phase, traffic is cut over to the AWS infrastructure by updating the DNS record in route 53, and all traffic and supporting data queries are supported by the AWS infrastructure. Now multi-site scenario is usually the preferred one where time is a priority, recovery time and recovery point time are our priorities, and costs are not the main constraint, then that would be the ideal scenario. Okay, so one key thing to ensure when we're running any of our scenarios is to ensure that we test the recovered data. So once we've restored our primary site to a working state, we then need to restore to the normal service, which is often referred to as a failback process. So depending on your DR strategy, this typically means reversing the flow of data replication so that any data updates received while the primary site was down can be replicated back without loss of data. Here is the first for backup and restore. First, we freeze the data changes on the DR site. Second, we take it back up. Third, we restore the backup to the primary site. Fourth, we re-point users to the primary site. And five, we unfreeze the changes. With pilot light, warm standby, and multi-site, first, we establish reverse mirroring and replication from the DR site back to the primary site once the primary site has caught up with the changes. Second, we freeze data changes to the DR site. And then third, we re-point users to the primary site. And then finally, we unfreeze the changes. Now most of those scenarios involve some sort of replication of data so let's just talk through some of the considerations on that. When you replicate data to a remote location, you really need to think through a number of factors. First, the distance between the sites. Now larger distances typically are subject to more latency or jitter. What is the available bandwidth? The breadth and variability of the interconnections is going to be important. If that bandwidth doesn't support high burst activity, then it's not gonna suit some replication models. And what is the data rate required by your applications? The data rate should be lower than the available bandwidth. And what is the replication technology that you plan to use? The replication technology should be parallel so that it can use the network effectively. So let's just look through a couple of these replication options we have, and this can be a bit confusing so let's just take this step by step. Okay, there's two types of replication: synchronous replications and asynchronous replication. These two can be very confusing when you're sitting in exam trying to remember which one is which so let's just step through this, and hope to give you some tips for how to remember it. With synchronous replication, data is atomically updated in multiple locations so this puts a dependency on network performance and availability so when deploying a multi-AZ mode, Amazon RDS use a synchronous replication to duplicate data to a second availability zone. This ensures that data is not lost if the primary availability zone becomes unavailable. Now the other type of replication is asynchronous replication, and with asynchronous replication, data is not atomically updated in multiple locations. It is transferred as network performance and availability allows, and the application continues to write data that might not be fully replicated yet so many database systems support asynchronous data replication. The database replica can be located remotely, and the replica does not have to be completely synchronized with the primary database server, and that's acceptable in many scenarios. For example, as a backup source, or reporting read-only use cases. In addition to database systems, you can also extend asynchronous replication to network file systems and data volumes. All right. Some of the AWS tools that we can use in all of the three scenarios. First one, AWS import and export. So AWS Import/Export accelerates moving large amounts of data in and out of AWS by using portable storage devices for transport so AWS Import/Export bypasses the internet, and transfers your data directly onto, and off of storage devices by means of high-speed internal networks at Amazon. So for data sets of large size, AWS Import/Export is often faster than internet transfer, and more cost effective than upgrading your connectivity. And you can use AWS Import/Export to migrate data in and out of Amazon S3 Buckets, and Amazon Glacier Vaults, or into Amazon EBS Snapshots. So in backup and recovery modes, it's a perfect way I've been able to move data offsite, and back onsite quickly when you need to, and AWS Import/Export Snowball is a fantastic device that you literally get shipped to you. You put the data back on to it, and then you ship it back. Another tool is AWS Storage Gateway so AWS Storage Gateway is a service that connects an on-premise software appliance with cloud-based storage to provide seamless and highly secure integration between your on-premise IT environment, and the storage infrastructure of AWS. And AWS Storage Gateway supports three different configurations. First, gateway-cached volumes where you can store your primary data in Amazon S3, and retain your frequently accessed data locally. Now gateway-cached volumes provide substantial cost savings on primary storage, and they minimized the need to scale your storage on-premise, and they retain low latency access to your frequently accessed data. The second option is gateway-stored volumes. Now that's good in the event where you need low latency access to your entire data set, and you can configure your gateway to store your primary data locally, and asynchronously backup point-in-time snapshots of this data to Amazon S3 so gateway-stored volumes provide durable, and an inexpensive offsite backups that you can recover locally or from Amazon EC2. If, for example, you need replacement capacity for disaster recovery. Now the third option with Storage Gateway is virtual tape libraries or gateway-VTL, and with gateway-VTL, you can have an almost limitless collection of virtual tapes that is stored in a virtual tape library so it feels and looks like a virtual tape library to you and your users. All three of these options can be met as iSCSI drives so it's seamless to the end user. It can be set up from the AWS console, and with the gateway-VTL, virtual tape libraries, you can also archive those to Amazon Glacier. So all three are very effective for backup and recovery in disaster recovery scenarios. Right. Very good. Let's do a quick summary of the four options we have for disaster recovery so you're well-prepped for your exam. So backup and restore, like using AWS as a virtual tape library. It's likely to have the highest recovery time objective because we need to factor in the time it would take us to access or download backup archives. Our recovery point objective is mostly likely to be quite high as well because if we're only doing daily backups, it could be up to 24 hours. With pilot light, we've got that minimal version of our environment running on AWS, which can be expanded to full size when needed. We've got potentially a lower recovery time objective than we would for backup and restore but we need to factor in that we still may need to install applications or patches onto our AMIs before we have a fully operational system. Our recovery point objective is going to be since the last snapshot so it's going to be reasonably low. For warm standby, we've got that scaled down version of a fully functional environment always running so our recovery time objective is likely to be lower than pilot light as some of our services are always running. Our recovery point objective is ideally going to be quite low since it will be since our last data write if it's a master-slave multi-AZ database. Even if it's asynchronous only, it's still going to give us quite a good recovery point objective. And the benefit of having a warm standby environment is that we can actually use it for dev test, or for one-off projects, or for scant work, et cetera. And a multi-site is that fully operational version of an environment running offsite or in another region, and it's likely to give us our lowest recovery time objective if we're using active-active failover. It could be a matter of seconds. With our recovery point objective, likewise it depends on the choice of data replication that we choose but it's gonna be since our last asynchronous or synchronous DB write, and using route 53 as an active-active failover, it's gonna give us a very, very aggressive, short recovery point objective, and recovery time objective. The considerations with that is that the cost is going to be reasonably higher proportionately than the other three options we have, and we need to factor in that there will be some ongoing maintenance required to keep that kind of environment running. The benefit is that we have a way of regularly testing our DR strategy. We also have a way of doing blue-green deployments, and it gives us a lot more diversity in our IT infrastructure.

About the Author
Learning Paths

Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built  70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+  years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.