Disaster Recovery


Course Summary
Start course

In addition to the many services covered on the AWS Certified Cloud Practitioner exam, you should be familiar with concepts and best practices designed to help AWS users succeed with cloud computing, and understand how AWS structures its services across the globe.

This course begins with a lecture covering the different types of AWS global infrastructure, which includes regions, availability zones, edge locations, and regional edge caches. What we’re talking about here is AWS data center hardware, and how it is organized around the world. Understanding how AWS organizes its infrastructure, how AWS infrastructure works, and how to use it to your benefit is essential AWS knowledge.

Next, we discuss the AWS’ Well-Architected Framework, a set of best practices established by experienced AWS solution architects. To be clear - knowledge of how to technically configure well-architected solutions is outside the scope of the AWS Certified Cloud Practitioner exam. However, you should be familiar with the fundamental best practices of cloud architecture, which we will introduce in this course.

Finally, we discuss basic techniques for disaster recovery. There are well-established methods for restoring AWS services, in the unlikely event of an outage. This course will not discuss the step-by-step process of disaster recovery, which is addressed in other courses. This course will provide an overview of each different method, and how each one balances the competing business needs of high availability and cost optimization.

Learning Objectives

  • Understand how the different components of AWS global infrastructure work, and can impact AWS cloud solutions
  • List and describe the five pillars of the AWS Well-Architected Framework
  • Summarize the standard disaster recovery methods, and how a business would select a method based on its service needs

Intended Audience

This course is designed for:

  • Anyone preparing for the AWS Certified Cloud Practitioner
  • Managers, sales professionals, and other non-technical roles


Before taking this course, you should have a general understanding of basic cloud computing concepts.


If you have thoughts or suggestions for this course, please contact Cloud Academy at


Okay, let's talk about disaster recovery. So a company typically decides on an acceptable business continuity plan based on the financial impact to the business when systems are unavailable. So the company determines the financial impact by considering many factors, such as the loss of business and the damage to its reputation due to downtime or the lack of systems availability. Now the common metrics for business continuity commonly refer to Recovery Time Objective and the Recovery Point Objective.

So let's just delve into these two concepts. So Recovery Time Objective, or the RTO, is the time it takes after a disruption to restore a business process to its service level, as was defined by the operational level agreement So for example, if a disaster occurs at 12 o'clock lunchtime, and the RTO is eight hours, then the disaster recovery process should restore the business process to the acceptable service level by 8:00 p.m. Now, the Recovery Point Objective, or RPO, is the acceptable amount of data loss measured in time. So, just to confuse everyone, it is also a time value But it's a slightly different one. The two are quite different concepts. So the acceptable amount of data loss measured in time. So for example, if the disaster occurred at 12 o'clock, around lunchtime, and the RPO is one hour, the system should recover all data that was in the system before 11:00 a.m. So the data loss will span only one hour between 11:00 a.m. and 12:00 p.m. So there's a number of different scenarios that we can apply in AWS to help meet the RPOs and RTOs. And the first one is what we call Backup & Restore.

Now with Backup & Restore, data is stored as a virtual tape library, using AWS storage gateway or another network appliance of a similar nature. We can use AWS import and export to shift large archives or in setting up archives for a Backup & Restore scenario. Then, in the disaster, archives are recovered from Amazon S3, and restored as if we were using a virtual tape.

All right, so the second potential design is what we call Pilot Light. And in Pilot Light, data is mirrored, and the environment is scripted as a template, which can be built out and scaled in the unlikely event of a disaster. So what that looks like in our recovery phase, in the unlikely event of a disaster, to recover the remainder of our environment around our Pilot Light, we can start our systems from the Amazon machine images on the appropriate instance types. And for our dynamic data service, we can resize them to handle production volumes as needed, or add capacity accordingly.

Okay, so the third scenario we can implement is what we call Warm Standby. And our key steps for preparation in a Warm Standby which is, as it says, essentially ready to go with all key services running in the most minimal possible way, so we're essentially running a smaller version of our full production environment. Then during our recovery phase, in the case of failure of the production system, the standby environment will be scaled up for production load, and the DNS records will be changed to route all traffic to the AWS environment.

Now our fourth potential scenario is what we call Multi-Site. With Multi-Site, we set up our AWS environment to duplicate our production environment. So essentially we've got a mirror of reproduction running in AWS. Now, in our recovery phase, traffic is cut over to the AWS infrastructure by updating the DNS record in route 53. And all traffic and supporting data queries are supported by the AWS infrastructure. Our Multi-Site scenario is usually the preferred one, and where time is a priority, recovery time and Recovery Point Time are priorities, and costs are not the main constraint, then that would be the ideal scenario.

Okay, so one key thing to ensure when we are running any of our scenarios is to ensure that we test the recovered data. So once we've restored our primary site to a working state, we then need to restore to a normal service, which is often referred to as a fail back process. So depending on your DR strategy, this typically means reversing the flow of data replication, so that any data updates received while the primary site was down can be replicated back without loss of data. Right, very good. Let's do a quick summary of the four options we have for disaster recovery so you're well-prepped for your exam. So Backup & Restore, like AWS as a virtual tape library. It's likely to have the highest Recovery Time Objective, because we need to factor in the time it would take us to access or download backup archives. Our Recovery Point Objective is most likely to be quite high as well, because if we're only doing daily backups, it could be up to 24 hours. With Pilot Light, we've got that minimal version of our environment running on AWS, which can be expanded to full size when needed.

We've got potentially a lower Recovery Time Objective than we would for Backup & Restore, but we need to factor in that we still may need to install applications, or patches onto our AMIs before we have a fully operational system. Our Recovery Point Objective is going to be since the last snapshot, so it's going to be reasonably low. For Warm Standby, we've got that scaled down version of a fully functional environment always running. So our Recovery Time Objective is likely to be lower than Pilot Light, as some of our services are always running. Our Recovery Point Objective is ideally going to be quite low, since it will be since our last data write, if it's master slave multi-AZ database. Even if it's asynchronous only, it's still going to give us quite a good Recovery Point Objective. And the benefit of having a Warm Standby environment is we can actually use it for dev tests, for one-off projects, or for skunkworks, et cetera.

And Multi Site is that fully operational version of our environment, running off-site or in another region. And it's likely to give us our lowest Recovery Time Objective, if we're using active / active fail over, it could be a matter of seconds. With our Recovery Point Objective likewise. It depends on the choice of data replication that we choose, but it's gonna be since our last asynchronous or synchronous DB write. And using route 53 as an active / active failover, it's going to give us a very very aggressive short Recovery Point Objective and Recovery Time Objective. The considerations with that is the cost is going to be reasonably higher, proportionately, than the other three options we have, and we need to factor in that there will be some ongoing maintenance required to keep that kind of environment running. The benefit is that we have a way of regularly testing our DR strategy. We also have a way of doing blue-green deployments, and it gives us a lot more diversity in our IT infrastructure.

About the Author
Learning Paths

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.