“You can’t predict a disaster, but you can be prepared for one!” Disaster recovery is one of the biggest challenges for infrastructure. Amazon Web Services allows us to easily tackle this challenge and ensure business continuity. In this post, we’ll take a look at what disaster recovery means, compare traditional disaster recovery versus that in the cloud, and explore essential AWS services for your disaster recovery plan.
What is Disaster Recovery?
There are several disaster scenarios that can impact your infrastructure. These include natural disasters such as an earthquake or fire, as well as those caused by human error such as unauthorized access to data, or malicious attacks.
“Any event that has a negative impact on a company’s business continuity or finances could be termed a disaster.”
In any case, it is crucial to have a tested disaster recovery plan ready. A disaster recovery plan will ensure that our application stays online no matter the circumstances.Ideally, it ensures that users will experience zero, or at worst, minimal issues while using your application.
If we’re talking about on-premise centers, a disaster recovery plan is expensive to maintain and implement. Often, such plans are insufficiently tested or poorly documented. As such, it’s adequate for protecting resources. More often than not, companies with a good disaster recovery plan aren’t capable of conducting it because it was never tested in a real environment. As a result, users cannot access the application and the company suffers significant losses.
Let’s take a closer look at some of the important terminology associated with disaster recovery:
Business Continuity. All of our applications require Business Continuity. Business Continuity ensures that an organization’s critical business functions continue to operate or recover quickly despite serious incidents.
Disaster Recovery. Disaster Recovery (DR) enables recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.
RPO and RTO. Recover Point Objective (RPO) and Recovery Time Objective (RTO) are the two most important parts of a good DR plan for our workflow. Recover Point Objective (RPO) is the maximum targeted period in which data might be lost from an IT service due to a major incident. Recovery Time Objective (RTO) is a targeted time period after which a business process must be restored after a disaster or disruption to service.
Traditional Disaster Recovery plan (on-premise)
A traditional on-premise Disaster Recovery plan often includes a fully duplicated infrastructure that is physically separate from the infrastructure that contains our production. In this case, an additional financial investment is required to cover expenses related to hardware and for maintenance and testing. When it comes to on-premise data centers, physical access to the infrastructure is often overlooked.
These are the security requirements for an on-premise data center disaster recovery infrastructure:
- Facilities to house the infrastructure, including power and cooling.
- Security to ensure the physical protection of assets.
- Suitable capacity to scale the environment.
- Support for repairing, replacing, and refreshing the infrastructure.
- Contractual agreements with an internet service provider (ISP) to provide internet connectivity that can sustain bandwidth utilization for the environment under a full load.
- Network infrastructure such as firewalls, routers, switches, and load balancers.
- Enough server capacity to run all mission-critical services. This includes storage appliances for the supporting data, and servers to run applications and backend services such as user authentication, Domain Name System (DNS), Dynamic Host Configuration Protocol (DHCP), monitoring, and alerting.
Obviously, this kind of disaster recovery plan requires large investments in building disaster recovery sites or data centers (CAPEX). In addition, storage, backup, archival and retrieval tools, and processes (OPEX) are also expensive. And, all of these processes, especially installing new equipment, take time.
An on-premise disaster recovery plan can be challenging to document, test, and verify, especially if you have multiple clients on a single infrastructure. In this scenario, all clients on this infrastructure will experience problems with performance even if only one client’s data is corrupted.
To understand how cloud storage fits in with DR and the different considerations when preparing to design a solution to backup your on-premise data to AWS, a great course to start with is Using AWS Storage for On-Premise Backup & Disaster Recovery.
Disaster Recovery plan on AWS
There are many advantages to implementing a disaster recovery plan on AWS.
Financially, we will only need to invest a small amount in advance (CAPEX), and we won’t have to worry about the physical expenses for resources (for example, hardware delivery) that we would have in on an “on-premise” data center.
AWS enables high flexibility, as we don’t need to perform a failover of the entire site in case only one part of our application isn’t working properly. Scaling is fast and easy. Most importantly, AWS allows a “pay as you use” (OPEX) model, so we don’t have to spend a lot in advance.
Also, AWS services allow us to fully automate our disaster recovery plan. This results in much easier testing, maintenance, and documentation of the DR plan itself.
This table shows the AWS service equivalents to infrastructure inside an on-premise data center.
|On-premise data center infrastructure||AWS Infrastructure|
|Web/app servers||EC2/Auto Scaling|
|AD/authentication||AD failover nodes|
|Dana centers||Availability Zones|
Essential AWS Services for Disaster Recovery
While planning and preparing a DR plan, we’ll need to think about the AWS services we can use. Also, we need to understand our selected services support data migration and durable storage. These are some of the key features and services that you should consider when creating your Disaster Recovery plan:
AWS Regions and Availability Zones – The AWS Cloud infrastructure is built around Regions and Availability Zones (“AZs”). A Region is a physical location in the world that has multiple Availability Zones. Availability Zones consist of one or more discrete data centers, each with redundant power, networking, and connectivity housed in separate facilities. These AZs allow you to operate production applications and databases that are more highly available, fault-tolerant, and scalable than would be possible from a single data center.
Amazon S3 – Provides a highly durable storage infrastructure designed for mission-critical and primary data storage. Objects are redundantly stored on multiple devices across multiple facilities within a region and are designed to provide a durability of 99.999999999% (11 9s).
Amazon Glacier – Provides extremely low-cost storage for data archiving and backup. Objects are optimized for infrequent access, for which retrieval times of several hours are adequate.
Amazon EBS – Provides the ability to create point-in-time snapshots of data volumes. You can use the snapshots as the starting point for new Amazon EBS volumes. And, you can protect your data for long-term durability because snapshots are stored within Amazon S3.
AWS Import/Export – Accelerates moving large amounts of data into and out of AWS by using portable storage devices for transport. The AWS Import/Export service bypasses the internet and transfers your data directly onto and off of storage devices using Amazon’s high-speed internal network.
AWS Storage Gateway is a service that connects an on-premise software appliance with cloud-based storage. This provides seamless, highly secure integration between your on-premise IT environment and the AWS storage infrastructure.
Amazon EC2 – Provides resizable compute capacity in the cloud. In the context of DR, the ability to rapidly create virtual machines that you can control is critical.
Amazon EC2 VM Import Connector enables you to import virtual machine images from your existing environment to Amazon EC2 instances.
Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service.
Elastic Load Balancing automatically distributes incoming application traffic across multiple Amazon EC2 instances.
Amazon VPC allows you to provision a private, isolated section of the AWS cloud. Here, you can launch AWS resources in a virtual network that you define.
Amazon Direct Connect makes it easy to set up a dedicated network connection from your premises to AWS.
Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud.
AWS CloudFormation gives developers and systems administrators an easy way to create a collection of related AWS resources and provision them in an orderly and predictable fashion. You can create templates for your environments and deploy associated collections of resources (called a stack) as needed.
Disaster Recovery Scenarios with AWS
There are several strategies that we can use for disaster recovery of our on-premise data center using AWS infrastructure:
- Backup and Restore
- Pilot Light
- Warm Standby
Backup and Restore
The Backup and Restore a scenario is an entry-level form of disaster recovery on AWS. This approach is the most suitable one in the event that you don’t have a DR plan.
In on-premise data centers, data backup would be stored on tape. Obviously, it will take time to recover data from tapes in the event of a disaster. For Backup and Restore scenarios using AWS services, we can store our data on Amazon S3 storage, making them immediately available if a disaster occurs. If we have a large amount of data that needs to be stored on Amazon S3, ideally we would use AWS Export/Import or even AWS Snowball to store our data on S3 as soon as possible.
AWS Storage Gateway enables snapshots of your on-premise data volumes to be transparently copied into Amazon S3 for backup. You can subsequently create local volumes or Amazon EBS volumes from these snapshots.
The Backup and Restore plan is suitable for lower level business-critical applications. This is also an extremely cost-effective scenario and one that is most often used when we need backup storage. If we use a compression and de-duplication tool, we can further decrease our expenses here. For this scenario, RTO will be as long as it takes to bring up infrastructure and restore the system from backups. RPO will be the time since the last backup.
The term “Pilot Light” is often used to describe a DR scenario where a minimal version of an environment is always running in the cloud. This scenario is similar to a Backup and Restore scenario. For example, with AWS you can maintain a Pilot Light by configuring and running the most critical core elements of your system in AWS. When the time comes for recovery, you can rapidly provision a full-scale production environment around the critical core.
A Pilot Light scenario is suitable for solutions that require a lower RTO and RPO. This scenario is a mid-range cost DR solution.
A Warm Standby scenario is an expansion of the Pilot Light scenario where some services are always up and running. As we plan a DR plan, we need to identify crucial points of our on-premise infrastructure and then duplicate it inside the AWS. In most cases, we’re talking about web and app servers running on a minimum-sized fleet. Once a disaster occurs, infrastructure located on AWS takes over the traffic and performs its scaling and converting to a fully functional production environment with minimal RPO and RTO.
The Warm Standby scenario is more expensive than Backup and Restore and Pilot Light because in this case, our infrastructure is up and running on AWS. This is a suitable solution for core business-critical functions and in cases where RTO and RPO need to be measured in minutes.
The Multi-Site scenario is a solution for an infrastructure that is up and running completely on AWS as well as on an “on-premise” data center. By using the weighted route policy on Amazon Route 53 DNS, part of the traffic is redirected to the AWS infrastructure, while the other part is redirected to the on-premise infrastructure.
Data is replicated or mirrored to the AWS infrastructure.
In a disaster event, all traffic will be redirected to the AWS infrastructure. This scenario is also the most expensive option, and it presents the last step toward full migration to an AWS infrastructure. Here, RTO and RPO are very low, and this scenario is intended for critical applications that demand minimal or no downtime.
There are many options and scenarios for Disaster Recovery planning on AWS.
The scope of possibilities has been expanded further with AWS’ announcement of its strategic partnership with VMware. Thanks to this partnership, users can expand their on-premise infrastructure (virtualized using VMware tools) to AWS, and create a DR plan via resources provided by AWS using VMware tools that they are already accustomed to using.
Don’t allow any kind of disaster to take you by surprise. Be proactive and create the DR plan that best suits your needs.
AWS Internet of Things (IoT): The 3 Services You Need to Know
The Internet of Things (IoT) embeds technology into any physical thing to enable never-before-seen levels of connectivity. IoT is revolutionizing industries and creating many new market opportunities. Cloud services play an important role in enabling deployment of IoT solutions that min...
Which Certifications Should I Get?
As we mentioned in an earlier post, the old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and compan...
How to Go Serverless Like a Pro
So, no servers? Yeah, I checked and there are definitely no servers. Well...the cloud service providers do need servers to host and run the code, but we don’t have to worry about it. Which operating system to use, how and when to run the instances, the scalability, and all the arch...
AWS Security: Bastion Hosts, NAT instances and VPC Peering
Effective security requires close control over your data and resources. Bastion hosts, NAT instances, and VPC peering can help you secure your AWS infrastructure. Welcome to part four of my AWS Security overview. In part three, we looked at network security at the subnet level. This ti...
Top 13 Amazon Virtual Private Cloud (VPC) Best Practices
Amazon Virtual Private Cloud (VPC) brings a host of advantages to the table, including static private IP addresses, Elastic Network Interfaces, secure bastion host setup, DHCP options, Advanced Network Access Control, predictable internal IP ranges, VPN connectivity, movement of interna...
Big Changes to the AWS Certification Exams
With AWS re:Invent 2019 just around the corner, we can expect some early announcements to trickle through with upcoming features and services. However, AWS has just announced some big changes to their certification exams. So what’s changing and what’s new? There is a brand NEW ...
New on Cloud Academy: ITIL® 4, Microsoft 365 Tenant, Jenkins, TOGAF® 9.1, and more
At Cloud Academy, we're always striving to make improvements to our training platform. Based on your feedback, we released some new features to help make it easier for you to continue studying. These new features allow you to: Remove content from “Continue Studying” section Disc...
AWS Security Groups: Instance Level Security
Instance security requires that you fully understand AWS security groups, along with patching responsibility, key pairs, and various tenancy options. As a precursor to this post, you should have a thorough understanding of the AWS Shared Responsibility Model before moving onto discussi...
Cloud Migration Risks & Benefits
If you’re like most businesses, you already have at least one workload running in the cloud. However, that doesn’t mean that cloud migration is right for everyone. While cloud environments are generally scalable, reliable, and highly available, those won’t be the only considerations dri...
Real-Time Application Monitoring with Amazon Kinesis
Amazon Kinesis is a real-time data streaming service that makes it easy to collect, process, and analyze data so you can get quick insights and react as fast as possible to new information. With Amazon Kinesis you can ingest real-time data such as application logs, website clickstre...
Google Cloud Functions vs. AWS Lambda: The Fight for Serverless Cloud Domination
Serverless computing: What is it and why is it important? A quick background The general concept of serverless computing was introduced to the market by Amazon Web Services (AWS) around 2014 with the release of AWS Lambda. As we know, cloud computing has made it possible for users to ...
Google Vision vs. Amazon Rekognition: A Vendor-Neutral Comparison
Google Cloud Vision and Amazon Rekognition offer a broad spectrum of solutions, some of which are comparable in terms of functional details, quality, performance, and costs. This post is a fact-based comparative analysis on Google Vision vs. Amazon Rekognition and will focus on the tech...