Managing Failover with Self-Healing Resources
Start course

This course covers the core learning objective to meet the requirements of the 'Designing Network and Data Transfer solutions in AWS - Level 3' skill

Learning Objectives:

  • Evaluate advanced techniques to detect for failure and service recoverability
  • Create an automated, cost-effective back-up solution that supports business continuity across multiple AWS Regions
  • Evaluate different architectures to provide application and infrastructure availability in the event of an AWS service disruption

When creating and deploying services, the ability to be able to restart and resume these services is crucial. And if the services can do this on their own, even better. This is typically a part of most modern software architectures such as microservices. Services talk to one another using messaging queues or other mechanisms, and they don't always assume that the listening party is always on. Before talking about your options here, let's discuss some pitfalls to avoid. Avoid deploying individual services in a single container or instance. Avoid application that just don't allow you to run multiple copies because it will start conflicting with itself. This is typically true for very old legacy applications. Finally, avoid applications and designs that require manual intervention when something goes wrong and a restart occurs. Automation plays a significant role when a failure occurs in order to take quick action and quickly enable an alternative solution. Let's look at some options that can help your application fail over on its own.

DynamoDB is a service that is automatically deployed to multiple availability zones. So, in the event of an AZ failing, DynamoDB should be able to recover on its own. RDS on the other hand, needs to have multi-AZ enabled. If enabled, it will automatically promote the standby copy to primary as soon as a failure is detected. Load balancing in combination with Auto Scaling can help spread your workload evenly across multiple availability zones. In addition, using load balancer, health checks, Auto Scaling, can trigger a scaling event for one or more of your availability zones in the event of one going offline. This combination of services is very powerful and can be used even if you're just running a single service on a single EC2 instance. This helps keep your service running as long as the region has at least one operational availability zone. In a similar fashion, load balancing can assist in managing the health state of your EKS pots or ECS tasks. From your perspective, you just have to make sure you're using as many AZs as possible in your network design and compute deployment.

A service like ECS for example, will not take advantage of available AZs, unless it's probably configured to do so. Combining all the above with services such as Route 53 and AWS Global Accelerator, you can start to expand your application's availability to a global scale. In fact, Route 53 has built-in fail over strategies with health checks in order to reroute traffic to another endpoint, possibly in another region, if the main one stopped responding. With global accelerator, you can have a static IP for your endpoint that can be pointed to a different resource if the main one stops responding. Creating cross region read replicas that can be promoted to primary is also a good database strategy to consider. Another less common strategy is to use Route 53 to fail over to your on-premises data center in the event of a failure, but something to consider as well. Lambda execution can and should be monitored by Amazon EventBridge.

Based on the response from the Lambda, a custom event can be triggered to fix the issue, send a notification, or retry the failed workload. Also, lambdas make a great combination with step functions to add an additional layer of retry logic, if needed. The step functions should be your number one choice when you know a process is likely to fail and require several attempts to complete. In fact, the step functions have built-in retry logic that you can configure. This is very useful when dealing with resources that are mostly disconnected from the network, such as IoT devices, mobile devices, satellite and naval devices, and so on. A failure, in this case, can be retried several times until completed. EFS, or Elastic File System is a great example of a tool that can be used for self-healing apps. If your website crashes for example, a new copy could be restarted on a different server, and as soon as the new copy comes online, it can be reconnected to EFS and find the files exactly as they were left by the previous running copy and resume operations as they were. This is possible because EFS is backed by S3 and your data automatically gets multiple copies across various regions.


About the Author
Carlos Rivas
Sr. AWS Content Creator
Learning Paths

Software Development has been my craft for over 2 decades. In recent years, I was introduced to the world of "Infrastructure as Code" and Cloud Computing.
I loved it! -- it re-sparked my interest in staying on the cutting edge of technology.

Colleagues regard me as a mentor and leader in my areas of expertise and also as the person to call when production servers crash and we need the App back online quickly.

My primary skills are:
★ Software Development ( Java, PHP, Python and others )
★ Cloud Computing Design and Implementation
★ DevOps: Continuous Delivery and Integration