Designing for Failure
Managing RTO and RPO for AWS Disaster Recovery
Designing for high availability, fault tolerance and cost efficiency
High Availability in RDS
High Availability in Amazon Aurora
High Availability in DynamoDB
The course is part of this learning path
This section of the Solution Architect Associate learning path introduces you to the High Availability concepts and services relevant to the SAA-C03 exam. By the end of this section, you will be familiar with the design options available and know how to select and apply AWS services to meet specific availability scenarios relevant to the Solution Architect Associate exam.
Want more? Try a lab playground or do a Lab Challenge!
- Learn the fundamentals of high availability, fault tolerance, and back up and disaster recovery
- Understand how a variety of Amazon services such as S3, Snowball, and Storage Gateway can be used for back up purposes
- Learn how to implement high availability practices in Amazon RDS, Amazon Aurora, and DynamoDB
In this lecture, I want to provide some guidance as to how you could go about reviewing your cloud architecture and applications to help you determine and classify the correct RTO and RPO. This can be a difficult task to achieve if you are unfamiliar with defining metrics relating to disaster recovery, but nevertheless, it is an essential part of your disaster recovery and business continuity planning to allow your organization to survive in the face of adversity.
An important aspect of defining these metrics is that you need to assess each of your applications individually, as not all of them will require the same level of RTO and RPO response. Some of your applications will no doubt require a much lower metric than others, and this is important to determine, as the lower the number the more complex the architecture will need to be to support those values, and in turn, the more it will cost you as a business to implement.
There are a number of questions that you could ask the business and application owners to help you understand the required RTO and RPO for the individual applications and networks that you need operational should an incident occur. Let’s take a look at a few of them.
From an RTO perspective, the AWS global infrastructure is always there, ready for use, and so provides you the core of underlying network architecture that you would normally have to manage and configure in a typical on-premise data centre with ease, such as networking, routing, DNS, etc, however despite this, you still need to ask the business:
If a specific application was lost in a disaster, what impact would it have on the business, and if lost for an extended period of time what are the repercussions?
What is the cost of the loss, from both a financial and sometimes more importantly a reputation point of view? If financial, what is the total cost per hour of impact?
Is there an associated service level agreement between the specific application and your customers that use that application or service that needs to be maintained?
Does the application act as a dependency on any other service or applications in your business, if so what are the implications of those should the application fail?
From a governance perspective, is the application bound by any external regulatory requirements that may affect how quickly the application needs to be back up and running?
The answers to these questions will help you establish whether or not the associated RTO can be measured in seconds, minutes, hours or days, or indeed anything in between or beyond that scope.
The answers to these questions can also be applied to measuring the RPO as well, as this will help determine how your backup strategy is used to recover data within a specified time frame. However when looking at RPO you should also identify if the data you need to recover can easily be recreated, and if recreating the data could be quicker than recovering the data. Another important factor when working with data, backups and RPO is understanding the rate of change with the data you need to recover. If the data changes once a month during a batch job, then monthly backups will suffice, if the data is changing hourly, then multiple backups a day might be required. Again, this is largely dependent on some of the other questions we have already run through.
You should also be aware of the AWS Resilience Hub which acts as a central location to help you manage, define, and validate how resilient your applications are that you are deploying with your AWS infrastructure. More information on this service can be found here:
So, in a nutshell, there is no simple and easy metric or rule to determine what your RPO and RTO should be for a Web server, a mobile application, a database, a monitoring and logging solution, it is all dependent on your own internal factors within your business, but it does need to be evaluated before your DR strategy can be designed and implemented.
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.