Building High Availability into your environment
Understanding SLAs in AWS
Which services should I use to build a decoupled architecture?
Managing RTO and RPO for AWS Disaster Recovery
The course is part of this learning path
This course covers the core learning objective to meet the requirements of the 'Designing for disaster recovery & high availability in AWS - Level 2' skill
- Analyze the amount of resources required to implement a fault-tolerant architecture across multiple AWS availability Zones
- Evaluate an effective AWS disaster recovery strategy to meet specific business requirements
- Understand SLA for AWS services to ensure the high availability of a given AWS solution
- Analyze which AWS services can be leveraged to implement a decoupled solution
The Pilot Light method of recovery is the 2nd tier in complexity, following Backup & Restore, when it comes to your recovery strategy. The main difference between Backup and Restore and Pilot Light is the introduction of replication of data between Primary and disaster recovery regions to help you reduce your RPO, in addition to having critical core infrastructure running in that DR region.
The benefit of this is that data in your DR region is ready and available a lot faster, thanks to the replication, plus the core infrastructure is already in place and operational, allowing you to scale out your required resources faster, especially when using AWS CloudFormation to deploy your environment. Having the data being replicated continuously from databases to the Disaster recovery region is a great way to achieve a very low RPO, this is especially true when used with the database point-in-time recovery functionality that is available.
As I just mentioned, having the ability to implement continuous asynchronous data replication between your primary region and your disaster recovery region greatly reduces the RPO of your data stores. AWS offers a number of different services to help you manage this element within your DR strategy, many of these focus on data stores, such as:
Amazon S3 using cross-region replication: will automatically replicate objects between buckets in your source and DR region as the object is written to, or you can use S3 batch replication to replicate existing objects in your source bucket to your DR region bucket.
Amazon RDS cross-region Read Replicas: allow your source RDS database to have a replica of the database in a separate region. This read replica is initially created from an automated DB snapshot of the source database allowing you to quickly recover in your DR region
Amazon Aurora Global Database: These provide low latency reads which can span more than one AWS region, as a result, they allow you to recover from an entire regional outage
Amazon DynamoDB global tables: These are also used to provide a multi-regional deployments for your DynamoDB tables with the added benefit of not having to manage your own replication, instead, this is provided by default as a managed service
Amazon DocumentDB global clusters: This cluster consists of a single primary region data store, which then replicates data to up to a further 5 secondary clusters in different regions, ideal for building a multi-regional DR solution when working with DocumentDB
Global datastore for Amazon Elasticache for Redis: Provides you the ability to work with cross-regional replica clusters for low-latency reads in the event of a regional disaster
The replication latency of each of these services is minimal and as a result your data will be replicated to your DR region pretty much as soon as the data is written to your primary data source in your source region. This significantly reduces your RPO in the event of a regional disaster.
The failover process of each of these services differ slightly, so you must be aware of what the process is and the time it takes for your secondary failover clusters, or global databases, to be promoted to the new primary, and this will of course affect your RTO.
So, it’s great to have your replicated data in your disaster recovery region when a failure occurs, but you still need a significant amount of other resources to return your application or solution to a full operational state which can manage the expected workloads.
When using a Pilot Light approach across multiple regions you need to ensure your solutions are deployed with this in mind. Any changes made in one region needs to be deployed in the DR region for core infrastructure, using AWS Cloudformation can help with the management of these deployments and changes required.
To help us better understand how the Pilot light recovery would work, here is an example of a 3 tiered web application deployment, using a front-end server tier, application servers at tier 2, and then finally a database tier in tier 3.
So let’s assume the primary region (on the left) is serving customers through a web application which is accessed via Route 53 for DNS, which then has associations with 2 external load balancers, one for each region. You will notice that the ELB for our primary region is ‘active’, and the connection to the ELB in the DR region is ‘inactive’ for production traffic.
The ELB in the primary region then serves traffic between load-balanced EC2 instances across 2 availability zones using an auto-scaling group. These instances then pass traffic to the application servers, again in their own auto-scaling group, and finally down to the database layer using Aurora. The data is then being stored on an Aurora global database using a shared cluster volume, which then has its data asynchronously replicated to the DR region to an Aurora Replica.
So, the minimal infrastructure that we have up and running in the DR region is the VPC, the ELB, Auto scaling groups and the Aurora Replica. This is considered ‘always on’ and can easily be managed and deployed using Infrastructure as Code tools, such as CloudFormation to ensure it matches the primary configuration.
In the event of a disaster, causing the Primary region to fail then the following actions should be taken to bring your environment back to a recovered state.
Based on the health check failing on the Primary ELB with Route 53, the secondary ELB can be activated to process incoming requests, this can be automated process or a manual change-over
Your application servers across tier 1 and tier 2 can be provisioned using pre-configured AMIs loaded with all of the appropriate application code required to run your environment allowing you to receive and process requests.
The promotion of your Aurora Replica will become the primary data store
So with a Pilot Light recovery strategy, replication is used to reduce the RPO of your strategy, in addition to core infrastructure being operations in the DR region with other resources, such as your application servers ‘switched off’ ready to be provisioned as and when required. The time taken for some of these resources, such as your application servers to be provisioned, which can then be scaled using auto-scaling groups can delay your RTO. If you require a shorter RTO, then your next option would be to use the Warm Standby recovery strategy.
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.