“You can’t predict a disaster, but you can be prepared for one!” Disaster recovery is one of the biggest challenges for infrastructure. Amazon Web Services allows us to easily tackle this challenge and ensure business continuity. In this post, we’ll take a look at what disaster recovery means, compare traditional disaster recovery versus that in the cloud, and explore essential AWS services for your disaster recovery plan.
What is Disaster Recovery?
There are several disaster scenarios that can impact your infrastructure. These include natural disasters such as an earthquake or fire, as well as those caused by human error such as unauthorized access to data, or malicious attacks.
“Any event that has a negative impact on a company’s business continuity or finances could be termed a disaster.”
In any case, it is crucial to have a tested disaster recovery plan ready. A disaster recovery plan will ensure that our application stays online no matter the circumstances.Ideally, it ensures that users will experience zero, or at worst, minimal issues while using your application.
If we’re talking about on-premise centers, a disaster recovery plan is expensive to maintain and implement. Often, such plans are insufficiently tested or poorly documented. As such, it’s adequate for protecting resources. More often than not, companies with a good disaster recovery plan aren’t capable of conducting it because it was never tested in a real environment. As a result, users cannot access the application and the company suffers significant losses.
Let’s take a closer look at some of the important terminology associated with disaster recovery:
Business Continuity. All of our applications require Business Continuity. Business Continuity ensures that an organization’s critical business functions continue to operate or recover quickly despite serious incidents.
Disaster Recovery. Disaster Recovery (DR) enables recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.
RPO and RTO. Recover Point Objective (RPO) and Recovery Time Objective (RTO) are the two most important parts of a good DR plan for our workflow. Recover Point Objective (RPO) is the maximum targeted period in which data might be lost from an IT service due to a major incident. Recovery Time Objective (RTO) is a targeted time period after which a business process must be restored after a disaster or disruption to service.
Traditional Disaster Recovery plan (on-premise)
A traditional on-premise Disaster Recovery plan often includes a fully duplicated infrastructure that is physically separate from the infrastructure that contains our production. In this case, an additional financial investment is required to cover expenses related to hardware and for maintenance and testing. When it comes to on-premise data centers, physical access to the infrastructure is often overlooked.
These are the security requirements for an on-premise data center disaster recovery infrastructure:
- Facilities to house the infrastructure, including power and cooling.
- Security to ensure the physical protection of assets.
- Suitable capacity to scale the environment.
- Support for repairing, replacing, and refreshing the infrastructure.
- Contractual agreements with an internet service provider (ISP) to provide internet connectivity that can sustain bandwidth utilization for the environment under a full load.
- Network infrastructure such as firewalls, routers, switches, and load balancers.
- Enough server capacity to run all mission-critical services. This includes storage appliances for the supporting data, and servers to run applications and backend services such as user authentication, Domain Name System (DNS), Dynamic Host Configuration Protocol (DHCP), monitoring, and alerting.
Obviously, this kind of disaster recovery plan requires large investments in building disaster recovery sites or data centers (CAPEX). In addition, storage, backup, archival and retrieval tools, and processes (OPEX) are also expensive. And, all of these processes, especially installing new equipment, take time.
An on-premise disaster recovery plan can be challenging to document, test, and verify, especially if you have multiple clients on a single infrastructure. In this scenario, all clients on this infrastructure will experience problems with performance even if only one client’s data is corrupted.
To understand how cloud storage fits in with DR and the different considerations when preparing to design a solution to backup your on-premise data to AWS, a great course to start with is Using AWS Storage for On-Premise Backup & Disaster Recovery.
Disaster Recovery plan on AWS
There are many advantages to implementing a disaster recovery plan on AWS.
Financially, we will only need to invest a small amount in advance (CAPEX), and we won’t have to worry about the physical expenses for resources (for example, hardware delivery) that we would have in on an “on-premise” data center.
AWS enables high flexibility, as we don’t need to perform a failover of the entire site in case only one part of our application isn’t working properly. Scaling is fast and easy. Most importantly, AWS allows a “pay as you use” (OPEX) model, so we don’t have to spend a lot in advance.
Also, AWS services allow us to fully automate our disaster recovery plan. This results in much easier testing, maintenance, and documentation of the DR plan itself.
This table shows the AWS service equivalents to infrastructure inside an on-premise data center.
|On-premise data center infrastructure||AWS Infrastructure|
|Web/app servers||EC2/Auto Scaling|
|AD/authentication||AD failover nodes|
|Dana centers||Availability Zones|
Essential AWS Services for Disaster Recovery
While planning and preparing a DR plan, we’ll need to think about the AWS services we can use. Also, we need to understand our selected services support data migration and durable storage. These are some of the key features and services that you should consider when creating your Disaster Recovery plan:
AWS Regions and Availability Zones – The AWS Cloud infrastructure is built around Regions and Availability Zones (“AZs”). A Region is a physical location in the world that has multiple Availability Zones. Availability Zones consist of one or more discrete data centers, each with redundant power, networking, and connectivity housed in separate facilities. These AZs allow you to operate production applications and databases that are more highly available, fault-tolerant, and scalable than would be possible from a single data center.
Amazon S3 – Provides a highly durable storage infrastructure designed for mission-critical and primary data storage. Objects are redundantly stored on multiple devices across multiple facilities within a region and are designed to provide a durability of 99.999999999% (11 9s).
Amazon Glacier – Provides extremely low-cost storage for data archiving and backup. Objects are optimized for infrequent access, for which retrieval times of several hours are adequate.
Amazon EBS – Provides the ability to create point-in-time snapshots of data volumes. You can use the snapshots as the starting point for new Amazon EBS volumes. And, you can protect your data for long-term durability because snapshots are stored within Amazon S3.
AWS Import/Export – Accelerates moving large amounts of data into and out of AWS by using portable storage devices for transport. The AWS Import/Export service bypasses the internet and transfers your data directly onto and off of storage devices using Amazon’s high-speed internal network.
AWS Storage Gateway is a service that connects an on-premise software appliance with cloud-based storage. This provides seamless, highly secure integration between your on-premise IT environment and the AWS storage infrastructure.
Amazon EC2 – Provides resizable compute capacity in the cloud. In the context of DR, the ability to rapidly create virtual machines that you can control is critical.
Amazon EC2 VM Import Connector enables you to import virtual machine images from your existing environment to Amazon EC2 instances.
Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service.
Elastic Load Balancing automatically distributes incoming application traffic across multiple Amazon EC2 instances.
Amazon VPC allows you to provision a private, isolated section of the AWS cloud. Here, you can launch AWS resources in a virtual network that you define.
Amazon Direct Connect makes it easy to set up a dedicated network connection from your premises to AWS.
Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud.
AWS CloudFormation gives developers and systems administrators an easy way to create a collection of related AWS resources and provision them in an orderly and predictable fashion. You can create templates for your environments and deploy associated collections of resources (called a stack) as needed.
Disaster Recovery Scenarios with AWS
There are several strategies that we can use for disaster recovery of our on-premise data center using AWS infrastructure:
- Backup and Restore
- Pilot Light
- Warm Standby
Backup and Restore
The Backup and Restore a scenario is an entry-level form of disaster recovery on AWS. This approach is the most suitable one in the event that you don’t have a DR plan.
In on-premise data centers, data backup would be stored on tape. Obviously, it will take time to recover data from tapes in the event of a disaster. For Backup and Restore scenarios using AWS services, we can store our data on Amazon S3 storage, making them immediately available if a disaster occurs. If we have a large amount of data that needs to be stored on Amazon S3, ideally we would use AWS Export/Import or even AWS Snowball to store our data on S3 as soon as possible.
AWS Storage Gateway enables snapshots of your on-premise data volumes to be transparently copied into Amazon S3 for backup. You can subsequently create local volumes or Amazon EBS volumes from these snapshots.
The Backup and Restore plan is suitable for lower level business-critical applications. This is also an extremely cost-effective scenario and one that is most often used when we need backup storage. If we use a compression and de-duplication tool, we can further decrease our expenses here. For this scenario, RTO will be as long as it takes to bring up infrastructure and restore the system from backups. RPO will be the time since the last backup.
The term “Pilot Light” is often used to describe a DR scenario where a minimal version of an environment is always running in the cloud. This scenario is similar to a Backup and Restore scenario. For example, with AWS you can maintain a Pilot Light by configuring and running the most critical core elements of your system in AWS. When the time comes for recovery, you can rapidly provision a full-scale production environment around the critical core.
A Pilot Light scenario is suitable for solutions that require a lower RTO and RPO. This scenario is a mid-range cost DR solution.
A Warm Standby scenario is an expansion of the Pilot Light scenario where some services are always up and running. As we plan a DR plan, we need to identify crucial points of our on-premise infrastructure and then duplicate it inside the AWS. In most cases, we’re talking about web and app servers running on a minimum-sized fleet. Once a disaster occurs, infrastructure located on AWS takes over the traffic and performs its scaling and converting to a fully functional production environment with minimal RPO and RTO.
The Warm Standby scenario is more expensive than Backup and Restore and Pilot Light because in this case, our infrastructure is up and running on AWS. This is a suitable solution for core business-critical functions and in cases where RTO and RPO need to be measured in minutes.
The Multi-Site scenario is a solution for an infrastructure that is up and running completely on AWS as well as on an “on-premise” data center. By using the weighted route policy on Amazon Route 53 DNS, part of the traffic is redirected to the AWS infrastructure, while the other part is redirected to the on-premise infrastructure.
Data is replicated or mirrored to the AWS infrastructure.
In a disaster event, all traffic will be redirected to the AWS infrastructure. This scenario is also the most expensive option, and it presents the last step toward full migration to an AWS infrastructure. Here, RTO and RPO are very low, and this scenario is intended for critical applications that demand minimal or no downtime.
There are many options and scenarios for Disaster Recovery planning on AWS.
The scope of possibilities has been expanded further with AWS’ announcement of its strategic partnership with VMware. Thanks to this partnership, users can expand their on-premise infrastructure (virtualized using VMware tools) to AWS, and create a DR plan via resources provided by AWS using VMware tools that they are already accustomed to using.
Don’t allow any kind of disaster to take you by surprise. Be proactive and create the DR plan that best suits your needs.
Cloud Migration Risks & Benefits
If you’re like most businesses, you already have at least one workload running in the cloud. However, that doesn’t mean that cloud migration is right for everyone. While cloud environments are generally scalable, reliable, and highly available, those won’t be the only considerations dri...
Real-Time Application Monitoring with Amazon Kinesis
Amazon Kinesis is a real-time data streaming service that makes it easy to collect, process, and analyze data so you can get quick insights and react as fast as possible to new information. With Amazon Kinesis you can ingest real-time data such as application logs, website clickstre...
Google Cloud Functions vs. AWS Lambda: The Fight for Serverless Cloud Domination
Serverless computing: What is it and why is it important? A quick background The general concept of serverless computing was introduced to the market by Amazon Web Services (AWS) around 2014 with the release of AWS Lambda. As we know, cloud computing has made it possible for users to ...
Google Vision vs. Amazon Rekognition: A Vendor-Neutral Comparison
Google Cloud Vision and Amazon Rekognition offer a broad spectrum of solutions, some of which are comparable in terms of functional details, quality, performance, and costs. This post is a fact-based comparative analysis on Google Vision vs. Amazon Rekognition and will focus on the tech...
New on Cloud Academy: CISSP, AWS, Azure, & DevOps Labs, Python for Beginners, and more…
As Hurricane Dorian intensifies, it looks like Floridians across the entire state might have to hunker down for another big one. If you've gone through a hurricane, you know that preparing for one is no joke. You'll need a survival kit with plenty of water, flashlights, batteries, and n...
Amazon Route 53: Why You Should Consider DNS Migration
What Amazon Route 53 brings to the DNS table Amazon Route 53 is a highly available and scalable Domain Name System (DNS) service offered by AWS. It is named by the TCP or UDP port 53, which is where DNS server requests are addressed. Like any DNS service, Route 53 handles domain regist...
How to Unlock Complimentary Access to Cloud Academy
Are you looking to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Cloud Security, Python, Java, or another technical skill? Then you'll want to mark your calendars for August 23, 2019. Starting Friday at 12:00 a.m. PDT (3:00 a.m. EDT), Cloud Academy is offering c...
What Exactly Is a Cloud Architect and How Do You Become One?
One of the buzzwords surrounding the cloud that I'm sure you've heard is "Cloud Architect." In this article, I will outline my understanding of what a cloud architect does and I'll analyze the skills and certifications necessary to become one. I will also list some of the types of jobs ...
Boto: Using Python to Automate AWS Services
Boto allows you to write scripts to automate things like starting AWS EC2 instances Boto is a Python package that provides programmatic connectivity to Amazon Web Services (AWS). AWS offers a range of services for dynamically scaling servers including the core compute service, Elastic...
Content Roadmap: AZ-500, ITIL 4, MS-100, Google Cloud Associate Engineer, and More
Last month, Cloud Academy joined forces with QA, the UK’s largest B2B skills provider, and it put us in an excellent position to solve a massive skills gap problem. As a result of this collaboration, you will see our training library grow with additions from QA’s massive catalog of 500+...
DevSecOps: How to Secure DevOps Environments
Security has been a friction point when discussing DevOps. This stems from the assumption that DevOps teams move too fast to handle security concerns. This makes sense if Information Security (InfoSec) is separate from the DevOps value stream, or if development velocity exceeds the band...
Test Your Cloud Knowledge on AWS, Azure, or Google Cloud Platform
Cloud skills are in demand | In today's digital era, employers are constantly seeking skilled professionals with working knowledge of AWS, Azure, and Google Cloud Platform. According to the 2019 Trends in Cloud Transformation report by 451 Research: Business and IT transformations re...