Disaster Recovery Scenarios in AWS
Start course
2h 3m

** Not all content covered in the course introduction has been added to the course at this time. Additional content is scheduled to be added to this course in the future. **

In this section of the AWS Certified: SAP on AWS Specialty learning path, we introduce you to strategies for configuring high availability and disaster recovery for SAP workloads on AWS.

Learning Objectives

  • Understand how to configure high availability with Amazon RDS
  • Identify backup and disaster recovery strategies using the AWS Cloud
  • Describe various approaches for business continuity and diaster recovery for SAP workloads on AWS


The AWS Certified: SAP on AWS Specialty certification has been designed for anyone who has experience managing and operating SAP workloads. Ideally you’ll also have some exposure to the design and implementation of SAP workloads on AWS, including migrating these workloads from on-premises environments. Many exam questions will require a solutions architect level of knowledge for many AWS services. All of the AWS Cloud concepts introduced in this course will be explained and reinforced from the ground up.


Disaster Recovery Scenarios in AWS. There are four pre-defined common practices for disaster recovery using AWS. They are classified from highest recovery time objective to the lowest recovery time objective. Recovery time objective, measuring the total time that it takes for your system to recover in the event of a failure. The four practices are called backup and restore, pilot light, low capacity standby, and active-active. Let's discuss each of them. 

The first is backup and restore. In this situation, all backups are stored to Amazon S3 directly or indirectly. In the event of a failure, you will need to obtain your backups from S3, use CloudFormation to automate deployment of the infrastructure, restore your systems from the backed-up AMIs, Snapshots, and other backup resources, and finally, switch to the new system by adjusting the DNS records accordingly. The recovery time objective for this approach is as long as it takes to bring up the replacement infrastructure and restore the systems from backups, and therefore, represents the longest recovery time objective. The RPO in terms of recovery point objective is the time since the last data backup.

Next is the pilot light disaster recovery. In this case, you pre-provision database replication and EC2 instances but not to full capacity, and not all resources are actually running. Route 53 will be responsible for triggering the disaster recovery event, which will automatically scale the capacity and then, switch traffic from one system to another. In the event of a failure, you will automatically start all pre-provision instances, you will scale your systems to production capacity, and then, switch to the new system by adjusting the DNS records accordingly. 

This approach is cost-effective because it entails provisioning less resources around the clock. The setup requires that you replicate your database to the replacement region or availability zone. It also requires that you prepare all resources for automatic startup and scale up. Finally, you need to ensure capacity by using reserve instances in order to guarantee that any EC2 instances are deployable when needed. The recovery time objective is represented as long as it takes to trigger the disaster recovery event, automatically scale up the replacement system and finally, move the traffic. 

The recovery point objective depends on the replication type used for the database instance. Third out of four patterns for disaster recovery is the fully working low capacity standby. If you take the pilot light implementation to the next level, you will end up with a low-capacity standby system. You basically have two separate running production environments: the primary environment and the low-capacity environment.

You can use Route 53 to dispatch a small percentage of production traffic as a way to implement a continuous testing mechanism for the replacement production system. The advantage is that the replacement system can actually handle some production traffic at any time. There is also a bit of cost saving and that you're not running a full print on the replacement system but a smaller footprint system until capacity needs to be scaled up when a disaster recovery event is triggered. The setup will require the same steps as pilot light with the exception that all components are now running but are not scaled to production levels. 

It is the best practice in this case to implement continuous testing of the replacement system by trickling a statistical subset of production traffic to it. In the event of a failure, you can immediately fail over critical loads by adjusting DNS records accordingly and expired any time-to-lives on items that are cached. Then, you auto-scale the system to handle production-level traffic. The recovery time objective for critical load is as long as it takes to trigger the disaster recovery event and fail over. This usually is the fastest recovery time objective available. For all other workloads, the recovery time objective will be as long as it takes to scale the resources properly and the cut over to take place. 

The recovery point objective depends on the replication type used for your database instances. Closing our disaster recovery schemes is the active-active implementation. The last approach to disaster recovery is to have two fully functional production systems on separate geographic locations or regions, if possible. The advantage is that this approach has the least possible downtime. However, it does have a cost associated because you're basically running two copies of your production environment. 

The setup is similar to the low-capacity standby but all systems are fully scaled and ready for production traffic. In the event of a failure, workloads immediately fail over to the replacement system as soon as the disaster recovery event is triggered. The recovery time objective is represented as long as it takes to detect the disaster recovery event and fail over the traffic. The recovery point objective depends on the type of database replication that you implement. This concludes our discussion of disaster recovery scenarios.


About the Author
Learning Paths

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.