“You can’t predict a disaster, but you can be prepared for one!” Disaster recovery is one of the biggest challenges for infrastructure. Amazon Web Services allows us to easily tackle this challenge and ensure business continuity. In this post, we’ll take a look at what disaster recovery means, compare traditional disaster recovery versus that in the cloud, and explore essential AWS services for your disaster recovery plan.
What is Disaster Recovery?
There are several disaster scenarios that can impact your infrastructure. These include natural disasters such as an earthquake or fire, as well as those caused by human error such as unauthorized access to data, or malicious attacks.
“Any event that has a negative impact on a company’s business continuity or finances could be termed a disaster.”
In any case, it is crucial to have a tested disaster recovery plan ready. A disaster recovery plan will ensure that our application stays online no matter the circumstances. Ideally, it ensures that users will experience zero, or at worst, minimal issues while using your application.
If we’re talking about on-premise centers, a disaster recovery plan is expensive to maintain and implement. Often, such plans are insufficiently tested or poorly documented. As such, it’s adequate for protecting resources. More often than not, companies with a good disaster recovery plan aren’t capable of conducting it because it was never tested in a real environment. As a result, users cannot access the application and the company suffers significant losses.
Let’s take a closer look at some of the important terminology associated with disaster recovery:
Business Continuity. All of our applications require Business Continuity. Business Continuity ensures that an organization’s critical business functions continue to operate or recover quickly despite serious incidents.
Disaster Recovery. Disaster Recovery (DR) enables recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.
RPO and RTO. Recover Point Objective (RPO) and Recovery Time Objective (RTO) are the two most important parts of a good DR plan for our workflow. Recover Point Objective (RPO) is the maximum targeted period in which data might be lost from an IT service due to a major incident. Recovery Time Objective (RTO) is a targeted time period after which a business process must be restored after a disaster or disruption to service.
Traditional Disaster Recovery plan (on-premise)
A traditional on-premise Disaster Recovery plan often includes a fully duplicated infrastructure that is physically separate from the infrastructure that contains our production. In this case, an additional financial investment is required to cover expenses related to hardware and for maintenance and testing. When it comes to on-premise data centers, physical access to the infrastructure is often overlooked.
These are the security requirements for an on-premise data center disaster recovery infrastructure:
- Facilities to house the infrastructure, including power and cooling.
- Security to ensure the physical protection of assets.
- Suitable capacity to scale the environment.
- Support for repairing, replacing, and refreshing the infrastructure.
- Contractual agreements with an internet service provider (ISP) to provide internet connectivity that can sustain bandwidth utilization for the environment under a full load.
- Network infrastructure such as firewalls, routers, switches, and load balancers.
- Enough server capacity to run all mission-critical services. This includes storage appliances for the supporting data, and servers to run applications and backend services such as user authentication, Domain Name System (DNS), Dynamic Host Configuration Protocol (DHCP), monitoring, and alerting.
Obviously, this kind of disaster recovery plan requires large investments in building disaster recovery sites or data centers (CAPEX). In addition, storage, backup, archival and retrieval tools, and processes (OPEX) are also expensive. And, all of these processes, especially installing new equipment, take time.
An on-premise disaster recovery plan can be challenging to document, test, and verify, especially if you have multiple clients on a single infrastructure. In this scenario, all clients on this infrastructure will experience problems with performance even if only one client’s data is corrupted.
To understand how cloud storage fits in with DR and the different considerations when preparing to design a solution to backup your on-premise data to AWS, a great course to start with is Using AWS Storage for On-Premise Backup & Disaster Recovery.
Disaster Recovery plan on AWS
There are many advantages to implementing a disaster recovery plan on AWS.
Financially, we will only need to invest a small amount in advance (CAPEX), and we won’t have to worry about the physical expenses for resources (for example, hardware delivery) that we would have in on an “on-premise” data center.
AWS enables high flexibility, as we don’t need to perform a failover of the entire site in case only one part of our application isn’t working properly. Scaling is fast and easy. Most importantly, AWS allows a “pay as you use” (OPEX) model, so we don’t have to spend a lot in advance.
Also, AWS services allow us to fully automate our disaster recovery plan. This results in much easier testing, maintenance, and documentation of the DR plan itself.
This table shows the AWS service equivalents to infrastructure inside an on-premise data center.
|On-premise data center infrastructure||AWS Infrastructure|
|Web/app servers||EC2/Auto Scaling|
|AD/authentication||AD failover nodes|
|Dana centers||Availability Zones|
Essential AWS Services for Disaster Recovery
While planning and preparing a DR plan, we’ll need to think about the AWS services we can use. Also, we need to understand our selected services support data migration and durable storage. These are some of the key features and services that you should consider when creating your Disaster Recovery plan:
AWS Regions and Availability Zones – The AWS Cloud infrastructure is built around Regions and Availability Zones (“AZs”). A Region is a physical location in the world that has multiple Availability Zones. Availability Zones consist of one or more discrete data centers, each with redundant power, networking, and connectivity housed in separate facilities. These AZs allow you to operate production applications and databases that are more highly available, fault-tolerant, and scalable than would be possible from a single data center.
Amazon S3 – Provides a highly durable storage infrastructure designed for mission-critical and primary data storage. Objects are redundantly stored on multiple devices across multiple facilities within a region and are designed to provide a durability of 99.999999999% (11 9s).
Amazon Glacier – Provides extremely low-cost storage for data archiving and backup. Objects are optimized for infrequent access, for which retrieval times of several hours are adequate.
Amazon EBS – Provides the ability to create point-in-time snapshots of data volumes. You can use the snapshots as the starting point for new Amazon EBS volumes. And, you can protect your data for long-term durability because snapshots are stored within Amazon S3.
AWS Import/Export – Accelerates moving large amounts of data into and out of AWS by using portable storage devices for transport. The AWS Import/Export service bypasses the internet and transfers your data directly onto and off of storage devices using Amazon’s high-speed internal network.
AWS Storage Gateway is a service that connects an on-premise software appliance with cloud-based storage. This provides seamless, highly secure integration between your on-premise IT environment and the AWS storage infrastructure.
Amazon EC2 – Provides resizable compute capacity in the cloud. In the context of DR, the ability to rapidly create virtual machines that you can control is critical.
Amazon EC2 VM Import Connector enables you to import virtual machine images from your existing environment to Amazon EC2 instances.
Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service.
Elastic Load Balancing automatically distributes incoming application traffic across multiple Amazon EC2 instances.
Amazon VPC allows you to provision a private, isolated section of the AWS cloud. Here, you can launch AWS resources in a virtual network that you define.
Amazon Direct Connect makes it easy to set up a dedicated network connection from your premises to AWS.
Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud.
AWS CloudFormation gives developers and systems administrators an easy way to create a collection of related AWS resources and provision them in an orderly and predictable fashion. You can create templates for your environments and deploy associated collections of resources (called a stack) as needed.
Disaster Recovery Scenarios with AWS
There are several strategies that we can use for disaster recovery of our on-premise data center using AWS infrastructure:
- Backup and Restore
- Pilot Light
- Warm Standby
Backup and Restore
The Backup and Restore a scenario is an entry-level form of disaster recovery on AWS. This approach is the most suitable one in the event that you don’t have a DR plan.
In on-premise data centers, data backup would be stored on tape. Obviously, it will take time to recover data from tapes in the event of a disaster. For Backup and Restore scenarios using AWS services, we can store our data on Amazon S3 storage, making them immediately available if a disaster occurs. If we have a large amount of data that needs to be stored on Amazon S3, ideally we would use AWS Export/Import or even AWS Snowball to store our data on S3 as soon as possible.
AWS Storage Gateway enables snapshots of your on-premise data volumes to be transparently copied into Amazon S3 for backup. You can subsequently create local volumes or Amazon EBS volumes from these snapshots.
The Backup and Restore plan is suitable for lower-level business-critical applications. This is also an extremely cost-effective scenario and one that is most often used when we need backup storage. If we use a compression and de-duplication tool, we can further decrease our expenses here. For this scenario, RTO will be as long as it takes to bring up infrastructure and restore the system from backups. RPO will be the time since the last backup.
The term “Pilot Light” is often used to describe a DR scenario where a minimal version of an environment is always running in the cloud. This scenario is similar to a Backup and Restore scenario. For example, with AWS you can maintain a Pilot Light by configuring and running the most critical core elements of your system in AWS. When the time comes for recovery, you can rapidly provision a full-scale production environment around the critical core.
A Pilot Light scenario is suitable for solutions that require a lower RTO and RPO. This scenario is a mid-range cost DR solution.
A Warm Standby scenario is an expansion of the Pilot Light scenario where some services are always up and running. As we plan a DR plan, we need to identify crucial points of our on-premise infrastructure and then duplicate it inside the AWS. In most cases, we’re talking about web and app servers running on a minimum-sized fleet. Once a disaster occurs, infrastructure located on AWS takes over the traffic and performs its scaling and converting to a fully functional production environment with minimal RPO and RTO.
The Warm Standby scenario is more expensive than Backup and Restore and Pilot Light because in this case, our infrastructure is up and running on AWS. This is a suitable solution for core business-critical functions and in cases where RTO and RPO need to be measured in minutes.
The Multi-Site scenario is a solution for an infrastructure that is up and running completely on AWS as well as on an “on-premise” data center. By using the weighted route policy on Amazon Route 53 DNS, part of the traffic is redirected to the AWS infrastructure, while the other part is redirected to the on-premise infrastructure.
Data is replicated or mirrored to the AWS infrastructure.
In a disaster event, all traffic will be redirected to the AWS infrastructure. This scenario is also the most expensive option, and it presents the last step toward full migration to an AWS infrastructure. Here, RTO and RPO are very low, and this scenario is intended for critical applications that demand minimal or no downtime.
There are many options and scenarios for Disaster Recovery planning on AWS.
The scope of possibilities has been expanded further with AWS’s announcement of its strategic partnership with VMware. Thanks to this partnership, users can expand their on-premise infrastructure (virtualized using VMware tools) to AWS, and create a DR plan via resources provided by AWS using VMware tools that they are already accustomed to using.
Don’t allow any kind of disaster to take you by surprise. Be proactive and create the DR plan that best suits your needs.