Disaster Recovery and Monitoring
Start course

This course covers the core learning objective to meet the requirements of the 'Designing Network and Data Transfer solutions in AWS - Level 3' skill

Learning Objectives:

  • Evaluate advanced techniques to detect for failure and service recoverability
  • Create an automated, cost-effective back-up solution that supports business continuity across multiple AWS Regions
  • Evaluate different architectures to provide application and infrastructure availability in the event of an AWS service disruption

As you probably know, the starting point of any disaster recovery plan is to have a plan in the first place. Let's get a bit more specific about what that actually means. Define clear recovery objectives: Avoid arbitrary ones such as last week's backup. This isn't specific enough. Make sure your objective meet the requirements of the business. Understand the impact of your plan; is it okay if I lost all my application logs from today?

Be realistic about the expectations of your plan. Zero data loss and near zero recovery time is very expensive and not very realistic. Remember, it's called a failure for a reason. Once you have your plan, you know what you're aiming for in terms of disaster recovery. Let's use an appropriate strategy. Your first consideration would be to determine if a single region strategy, it's okay, or if you would prefer to be able to recover your application entirely in a different region.

For most use cases, a single region will suffice. Is a backup strategy enough? Consider the time it will take to recover from a backup file. Perhaps a pilot light scenario is better. In this scenario, you would have a minimal set of resources already operational, for example, an auto scaling groups at zero or one instance, ready to be scaled to full size in a moment's notice. A warm standby model is when you have a fully operational but scaled down version of your application always running and ready to become your primary in the event of a region or AZ failing. Finally, an active/active strategy is when you are not limited by budget and you just run two or more copies of your application at full capacity in two or more regions. This offers the fastest recovery time and the lowest possible recovery point objective. Keep in mind, you can still lose data in this scenario depending on how quickly your health checks and alarms detect the failure and handle the fail over to the other region.

Regardless of which DR strategy you choose, you should have a procedure to recover your data and servers back to operational mode in the recovery region. The number one tool to help you maintain high availability is monitoring, that is, being able to detect failures or assistant degradation long before your application is completely impaired. These are some of the tools that can help you achieve this: use alarms so that no outage goes unnoticed. Make sure your alarm thresholds are generous enough to consider not only complete failures but also response times to the tech system degradation early on. Collect as many useful metrics as possible. Be sure to monitor all your compute components end-to-end, not just your phone end servers. Very important, collect business metrics, not only technical ones, for example, order count, new customers, customer leads, and so on. All this information can help you detect potential issues long before they become a complete outage to your business. CloudWatch alarms along with EC2 detailed monitoring can help create action triggers based on EC2 instance health. CloudWatch can also assist with the dashboard to visualize your entire fleet, giving you a quick glance at how things are behaving.

Custom CloudWatch metrics can be created and collected. This can be technical, but they can also be business related and specific to your needs. This can also be combined with alarms, for example, a lowered and normal web visitor count can signal that something nontechnical may be affecting your website, such as a marketing channel dropping off or lower SEO rankings. You can also create a CloudWatch dashboard specific to non-technical metrics that you can collect and aggregate. This is great to share with your non-technical colleagues who may have a greater insight into what could be happening when anomalies in this data occurs.


About the Author
Carlos Rivas
Sr. AWS Content Creator
Learning Paths

Software Development has been my craft for over 2 decades. In recent years, I was introduced to the world of "Infrastructure as Code" and Cloud Computing.
I loved it! -- it re-sparked my interest in staying on the cutting edge of technology.

Colleagues regard me as a mentor and leader in my areas of expertise and also as the person to call when production servers crash and we need the App back online quickly.

My primary skills are:
★ Software Development ( Java, PHP, Python and others )
★ Cloud Computing Design and Implementation
★ DevOps: Continuous Delivery and Integration