Testing Resilience
Start course

This course covers the core learning objective to meet the requirements of the 'Designing Network and Data Transfer solutions in AWS - Level 3' skill

Learning Objectives:

  • Evaluate advanced techniques to detect for failure and service recoverability
  • Create an automated, cost-effective back-up solution that supports business continuity across multiple AWS Regions
  • Evaluate different architectures to provide application and infrastructure availability in the event of an AWS service disruption

Finally, your workloads is designed to handle everything you can throw at it. AC failures, server failures, network issues, everything. But are you sure those things will work as designed? In this lecture, let's consider best practices around testing and troubleshooting. Before we dive in, let me explain the difference between a playbook and a runbook. A playbook is a series of steps to take while managing failures and responding to incidents, while a runbook is a series of steps to achieve a specific outcome. So, while writing a playbook, you are proactively planning for something that may or may not happen. So, keep in mind that a playbook will always be a living document. So, each time something new happens, fix the issue, quickly go back to your playbook to document it while still fresh in your memory. Here's three tips for when it's time to implement your troubleshooting playbooks.

When deploying a new workload, is best to create a playbook ahead of time to account for common issues that will need to be diagnosed or fixed quickly. Know ahead of time where your information resides, specifically logs and anything that can help you locate a problem quickly. Also, retain your logs. It's really bad if a problem has been going on for a week and you only have 24 hours worth of log files. If you can't keep logs due to space constraints or budget, create a metric and keep a long history of that metric. So, how exactly do we implement a playbook for incident response? Playbooks are best implemented as code, whether it be a Python script, Linux or PowerShell script depending on your needs or platform. Playbooks also need to be properly documented because in the heat of the moment, during a failure in the middle of the night, clear and specific instructions can help guide the operator, even if the operator is the same person writing the documentation.

These are some of the tools available to you that can help automate tasks. These tools are also 100% managed by AWS so they are already fully redundant and fault-tolerant. Assistance manager run command, which you can use to send a series of commands to your servers, including those scripts we just mentioned. Assistant Manager in general provides great tools to patch, update, and maintain a fleet of EC2s. Lambda functions in combination with EventBridge can be very powerful in order to execute scheduled tasks or tasks that respond automatically to some events. Even better, you combine EventBridge, Lambdas with CloudWatch alarms because in this case you can be proactive and set a specific thresholds that when exceeded, task can be launched to start responding to an incident and notify clinical personnel of the event occurrence. A good post-incident analysis is an amazing tool to have in place because it's an opportunity to propose solutions and prevent the failure in the future, or at least manage it better.

This analysis should also identify potential weaknesses in your software architecture, which leads to improvements. Remember what we said about playbooks? Post-mortem analysis is precisely when you'll take notes about what happened, pinpoint contributing factors and document the steps taken to solve the issue. This will make your playbook super useful whenever this same problem pops up again in the future. So, always remember, fix the failure and then set some time aside for a post-mortem analysis. Have you ever seen those super popular website launches where the server crashes on launch day simply because the owners didn't expect such massive amounts of traffic? You can actually prevent this with proper load testing. Here's how you implement load testing. Ideally, you created all your infrastructure using IaC, Infrastructure-as-Code, and you can easily deploy an identical copy of your production environment. We'll use this new copy of your production environment to run a load testing tool.

Apache JMeter is a great one to simulate thousands of hits to your website simultaneously. This will force all your Auto Scaling groups and load bouncers to respond to the event and adjust accordingly. Another thing that gets exposed in this case is any service quota limits that may be in effect. Your Auto Scaling group may be set to deploy 100 or more servers of a particular type, however, if you forget to ask AWS to lift your service quotas for that server type, this will fail, which is great. This is why we perform testing after all. After the load test finishes, you'll have an idea of how much traffic your infrastructure can handle, but please wait a few minutes before destroying your load testing environment to also ensure that all your Auto Scaling groups scale back down accordingly. Can you do this directly in production? Sure, but be careful to increment the test workload slowly as to minimize the chances of an outage or disrupt any real users. A very harsh way to ensure that your failure recovery is working correctly is to intentionally crash one or more of your production servers. This is known as Chaos Engineering.

AWS provides a tool specifically for this purpose. AWS Fault Injection Simulator or FIS is a service that enables you to perform fault injection experiments on your AWS workloads. Fault injections is based on the principle of Chaos Engineering. This means intentionally stressing an application by creating disruptive events such as taking down servers so that you can observe how your application responds. With this information, you can improve performance and resilience overall. To use FIS, you set up and run experiments that help you create real-world conditions needed to uncover application issues that can be difficult to find otherwise. FIS provides templates for disruptions and controls and limits to run experiments in production such as rolling back or stopping the experiment when you're ready. Keep in mind that this is not the kind of tool that you want to try out in production first; get familiar with it by deploying it in lower environment and evaluate its behavior there. Only then take it to production.

A Game Day is a great opportunity to test those playbooks and runbooks with all the procedures that need to be executed during a failure. You will be simulating a failure to test systems, processes, and how the team responds to it all. The main goal is to develop a sort of muscle memory, if you will, so that when it does occur in the real world, all parties involved will know exactly what to do. Game Day often takes place in production, so all necessary precautions need to be in place to ensure there is no impact on availability to customers. This includes backups, rollback procedures, transitioning databases to the standby AZ or region, and any manual or automated process to revert everything back when the simulation ends. A few things to remember: Documented procedures need constant revision and rehearsals, be sure to include key business decision makers during game days, not just a technical team because they will need to make quick decisions when a failure occurs, and any deviation from the current playbooks is fine as long as the document is updated right away.


About the Author
Carlos Rivas
Sr. AWS Content Creator
Learning Paths

Software Development has been my craft for over 2 decades. In recent years, I was introduced to the world of "Infrastructure as Code" and Cloud Computing.
I loved it! -- it re-sparked my interest in staying on the cutting edge of technology.

Colleagues regard me as a mentor and leader in my areas of expertise and also as the person to call when production servers crash and we need the App back online quickly.

My primary skills are:
★ Software Development ( Java, PHP, Python and others )
★ Cloud Computing Design and Implementation
★ DevOps: Continuous Delivery and Integration