This course looks at how you can use Recovery Time Objective (RTO) and Recovery Point Objectives (RPO) to determine an AWS disaster recovery strategy. RTO and RPO both fall under the Reliability pillar of the AWS Well-Architected Framework.
- The difference between RTO and RPO
- How to classify your RTO and RPO
- An understanding of 4 recovery strategies, including:
- Backup & Restore
- Pilot Light
- Warm Standby
- Multi-Site Active/Active
- Those who are in a role of an AWS site reliability engineer (SRE)
- Anyone who has a responsibility for and input into maintaining an effective business continuity and/or disaster recovery strategy for your AWS environment
- You should have a basic understanding of the AWS global architecture and understand the concepts and importance of building high availability into your infrastructure.
If you are looking to minimize your RTO and RPO greater than what ‘Backup & Restore’, or what ‘Pilot Light’ strategies could offer you when planning for a regional failover, then ‘Warm Standby’ would be your next best choice.
This effectively builds upon Pilot Light, and as a result is more complex and expensive. The change made with Warm Standby is the fact that you will have a scaled down version of your primary region up and running and operational in your designated DR region, and this helps to reduce your RTO more so than what Pilot light could offer. Having another region that is effectively always running, albeit scaled down to only include the minimal amount of resources required to run your workloads, offers an advantage as you will already have resources running to immediately start processing incoming requests, which was not possible with ‘Pilot Light’.
Other than that point, and the fact that there is an increased cost with ‘warm stand-by’ due to the additional resources running, there aren’t really any other distinguishing features between the two strategies. The same AWS services used for ‘Backup & Restore’, and ‘Pilot Light’ can also be used in exactly the same way to perform replication of data across a multi-region DR strategy.
So let’s run through another example to show the recovery process of an environment in which there is a regional failure.
With our primary region on the left hosting our web application infrastructure, requests come via Route 53 which can direct traffic to ELBs between our primary region and our DR region when required. The ELB in the primary region serves traffic between load balanced EC2 instances across 2 availability zones using an auto scaling group. These instances then pass traffic to the application servers, again in their own auto scaling group, and finally down to the database layer using Aurora. The data is then being stored on an Aurora global database using a shared cluster volume, which then has its data asynchronously replicated to the DR region to an Aurora Replica.
If we take a look at our DR region we can see that we have our key resources running as would be the case for ‘Pilot Light’ strategies, including the ELB, VPC and Aurora replica. However, we also have resources running for our tier 1 web server layer, and also at the Tier 2 application layer too. Notice however, that this is a scaled down version of this infrastructure from our primary region, so effectively, we have a single availability zone version of our primary multi-availability zone deployment running as active. We also have ‘Write forwarding’ active on the secondary Aurora replica which allows the Aurora replica to forward SQL statements that perform write operations to the primary cluster. When received, the primary in the cluster will then update the source data store before sending out the required updates to ALL secondary replicas in the global cluster in all applicable regions.
So what’s next? Well next we have our failure and our primary region is taken down, what happens? Well from an RPO point of view, we’ve had continual asynchronous replication happening between our primary and DR regions occurring, so from a data standpoint we are good in a DR region. Health checks will trigger a non-responsive report from the primary ELB by Route 53, so the failover action to the DR ELB can be initiated either automatically or manually. We already have our 3 tiered infrastructure in place, thanks to the ‘warm standby’ offering. Our web servers can immediately start to process requests, our application layer can respond to them, and our Aurora DB will be promoted and start serving the application layer.
So as you can see, there are no tier 1 and tier 2 resources to provision and spin up to begin processing requests from the ELB, they are already there, so your RTO is reduced. However, you will need to rely on your Auto Scaling groups to effectively scale out your infrastructure as your workload increases. While these operations and resources are provisioned, your customer may incur some connectivity issues due to lack of available resources to process their requests until the DR region becomes fully operational at the required capacity.
You can of course mitigate this issue by going one step above and beyond ’Warm Stand-by’ by implementing the most costly of all the DR strategies, Multi-site active/active.
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.