This course looks at how you can use Recovery Time Objective (RTO) and Recovery Point Objectives (RPO) to determine an AWS disaster recovery strategy. RTO and RPO both fall under the Reliability pillar of the AWS Well-Architected Framework.
Learning Objectives
- The difference between RTO and RPO
- How to classify your RTO and RPO
- An understanding of 4 recovery strategies, including:
- Backup & Restore
- Pilot Light
- Warm Standby
- Multi-Site Active/Active
Intended Audience
- Those who are in a role of an AWS site reliability engineer (SRE)
- Anyone who has a responsibility for and input into maintaining an effective business continuity and/or disaster recovery strategy for your AWS environment
Prerequisites
- You should have a basic understanding of the AWS global architecture and understand the concepts and importance of building high availability into your infrastructure.
In this final lecture, I want to recap some of the key points taken from the previous lectures.
I began by looking at the definitions of both RTO and RPO, which were defined as the following:
-
RTO - Recovery Time Objective - defined as the maximum amount of time in which a service can remain unavailable for before it can be classed as damaging to the business.
-
RPO - Recovery Point Objective - defined as the maximum amount of time for which data could be lost for a service.
I then highlighted the four recovery strategies:
I then covered how you should approach defining what your RTO and RPO metrics should be, and in this lesson I explained that:
-
Defining your RTO and RPO is an essential part of your disaster recovery and business continuity planning
-
RTO and RPO should be defined for each of your applications individually
-
The lower the metrics are, the more complex the architecture will need to be to support those values, and in turn, the more it will cost you as a business to implement.
-
Key questions you should ask to understand your RTO and RPO values include:
-
What impact would the loss of an application have on the business?
-
What are the repercussions of this loss?
-
What would the financial impact be?
-
Are there any SLAs that need to be upheld
-
What dependencies are in place on the application?
-
Are you bound by any external regulatory requirements?
-
The AWS Resilience Hub is an AWS service that acts as a central location to help you manage, define, and validate how resilient your applications are that you are deploying with your AWS infrastructure.
-
There is no simple and easy metric or rule to determine what your RPO and RTO should be, it is all dependent on your own internal factors within your business
Following this lecture I then began to dive deeper into the individual recovery strategies:
Starting with Backup & Restore:
-
The 1st tier of the 4 recovery strategies
-
This method provides the longest RTO and RPO values and required the most amount of effort to recover
-
Backup & Restore generally assumes your RTO will be 24 hours or less, with an RPO measured in hours.
-
The cheapest option of the 4 recovery methods
-
Point-in-time recovery allows you additional flexibility across some database and storage services
-
Using AWS Backup can help you manage your backups acting as a central hub to control backups across your environment, across multiple regions
-
Upon regional failure, recovery can be achieved by restoring resources using backups in a new region
Next, we have the Pilot Light:
-
This is the 2nd tier in complexity and cost, following Backup & Restore
-
The main difference between Backup and Restore and Pilot Light is the introduction of replication of data between source and disaster recovery regions to help you reduce your RPO
-
Includes the addition of having critical core infrastructure running in that DR region which is considered ‘Always on’.
-
Having the data replicated continuously from databases to the Disaster recovery region is a great way to achieve a very low RPO
-
Any changes made in one region needs to be deployed in the DR region for core infrastructure, using AWS Cloudformation can help with the management of these deployments and changes required.
-
Upon regional failure, recovery can be achieved using application servers from pre-configured images in the DR region. Data stores will already be available due to continuous replication
Then moved on to Warm Standby:
-
The 3rd tier in complexity and cost, following Pilot Light
-
Similar to Pilot Light, however, Warm Standby has a scaled down version of your primary region up and running, and operational in your designated DR region
-
Reduces RTO when compared to Pilot Light
-
Ability to process incoming request immediately after failure using scaled down resources in designated DR region
-
Use Auto Scaling to scale out the required resources to meet the desired needs of the workload
And then finally, Multi-Site Active/Active:
-
The 4th Tier in complexity and cost, following Warm Standby
-
Offers you the lowest RTO and RPO when it comes to defining your DR strategy
-
With Multi-site active/active you are effectively deploying your infrastructure across multiple regions at full scale
-
There is no designated DR region
-
Your customers can access your applications and services from any region they require
That now brings me to the end of this lecture and to the end of this course, and so you should now have a greater understanding of how to manage RTO and RPO for AWS Disaster Recovery, and some of the strategies that you could use.
Feedback on our courses here at Cloud Academy is valuable to both us as trainers and any students looking to take the same course in the future. If you have any feedback, positive or negative, it would be greatly appreciated if you could contact support@cloudacademy.com.
Thank you for your time and good luck with your continued learning of cloud computing. Thank you.
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.