Disaster Recovery Scenarios in AWS
Start course

In this course, we discuss planning for data recovery, including disaster recovery of SAP workloads in AWS. We present and discuss some of the design and best practices gathered by AWS customers, AWS Experts, and SAP Specialists running SAP workloads on AWS.

Learning Objectives

We introduce best practices for business continuity and disaster recovery related to SAP workloads on AWS. The recommendations are aligned with the Reliability pillar of the Well-Architected Framework and focus on planning for data protection and recovery of SAP solutions implemented using AWS services.

Intended Audience

This course is intended for SAP architects and SAP Operators who deploy and maintain SAP workloads on AWS. This course also aligns with the objectives of the AWS Certified: SAP on AWS Specialty (PAS-C01) exam.


To get the most from this course, you will need to meet the requirements for the AWS Solutions Architect Associate or AWS SysOps Associate certifications or the equivalent experience. This includes the function, anatomy, and operation of core AWS services that are relevant to SAP implementations, such as: 

  • The AWS global infrastructure, Amazon VPCs, Amazon EC2, EBS, EFS, S3, Glacier, IAM, CloudWatch, CloudTrail, the AWS CLI, Amazon Route 53
  • The Well-Architected Framework

It is also assumed that you are familiar with SAP software workloads and their implementation. SAP is well known for enterprise resource planning (ERP) applications, including SAP Business Suite, SAP Net weaver, SAP S/4HANA solutions, and supporting products.


Disaster Recovery Scenarios in AWS. There are four pre-defined common practices for disaster recovery using AWS. They are classified from highest recovery time objective to the lowest recovery time objective. Recovery time objective, measuring the total time that it takes for your system to recover in the event of a failure. The four practices are called backup and restore, pilot light, low capacity standby, and active-active. Let's discuss each of them. 

The first is backup and restore. In this situation, all backups are stored to Amazon S3 directly or indirectly. In the event of a failure, you will need to obtain your backups from S3, use CloudFormation to automate deployment of the infrastructure, restore your systems from the backed-up AMIs, Snapshots, and other backup resources, and finally, switch to the new system by adjusting the DNS records accordingly. The recovery time objective for this approach is as long as it takes to bring up the replacement infrastructure and restore the systems from backups, and therefore, represents the longest recovery time objective. The RPO in terms of recovery point objective is the time since the last data backup.

Next is the pilot light disaster recovery. In this case, you pre-provision database replication and EC2 instances but not to full capacity, and not all resources are actually running. Route 53 will be responsible for triggering the disaster recovery event, which will automatically scale the capacity and then, switch traffic from one system to another. In the event of a failure, you will automatically start all pre-provision instances, you will scale your systems to production capacity, and then, switch to the new system by adjusting the DNS records accordingly. 

This approach is cost-effective because it entails provisioning less resources around the clock. The setup requires that you replicate your database to the replacement region or availability zone. It also requires that you prepare all resources for automatic startup and scale up. Finally, you need to ensure capacity by using reserve instances in order to guarantee that any EC2 instances are deployable when needed. The recovery time objective is represented as long as it takes to trigger the disaster recovery event, automatically scale up the replacement system and finally, move the traffic. 

The recovery point objective depends on the replication type used for the database instance. Third out of four patterns for disaster recovery is the fully working low capacity standby. If you take the pilot light implementation to the next level, you will end up with a low-capacity standby system. You basically have two separate running production environments: the primary environment and the low-capacity environment.

You can use Route 53 to dispatch a small percentage of production traffic as a way to implement a continuous testing mechanism for the replacement production system. The advantage is that the replacement system can actually handle some production traffic at any time. There is also a bit of cost saving and that you're not running a full print on the replacement system but a smaller footprint system until capacity needs to be scaled up when a disaster recovery event is triggered. The setup will require the same steps as pilot light with the exception that all components are now running but are not scaled to production levels. 

It is the best practice in this case to implement continuous testing of the replacement system by trickling a statistical subset of production traffic to it. In the event of a failure, you can immediately fail over critical loads by adjusting DNS records accordingly and expired any time-to-lives on items that are cached. Then, you auto-scale the system to handle production-level traffic. The recovery time objective for critical load is as long as it takes to trigger the disaster recovery event and fail over. This usually is the fastest recovery time objective available. For all other workloads, the recovery time objective will be as long as it takes to scale the resources properly and the cut over to take place. 

The recovery point objective depends on the replication type used for your database instances. Closing our disaster recovery schemes is the active-active implementation. The last approach to disaster recovery is to have two fully functional production systems on separate geographic locations or regions, if possible. The advantage is that this approach has the least possible downtime. However, it does have a cost associated because you're basically running two copies of your production environment. 

The setup is similar to the low-capacity standby but all systems are fully scaled and ready for production traffic. In the event of a failure, workloads immediately fail over to the replacement system as soon as the disaster recovery event is triggered. The recovery time objective is represented as long as it takes to detect the disaster recovery event and fail over the traffic. The recovery point objective depends on the type of database replication that you implement. This concludes our discussion of disaster recovery scenarios.


About the Author
Jorge Negrón
AWS Content Architect
Learning Paths

Experienced in architecture and delivery of cloud-based solutions, the development, and delivery of technical training, defining requirements, use cases, and validating architectures for results. Excellent leadership, communication, and presentation skills with attention to details. Hands-on administration/development experience with the ability to mentor and train current & emerging technologies, (Cloud, ML, IoT, Microservices, Big Data & Analytics).