Managing for Failure in AWS

In this course, we will cover a few strategies used to handle failures in AWS and how to recover from disasters.

Learning Objectives

  • Fault Isolation
  • Testing reliability
  • Disaster recovery
  • Designing for auto-recovery

Intended Audience

  • Those who are already familiar with AWS infrastructure and its main components
  • DevOps engineers will also benefit from this course by expanding their knowledge in the area of infrastructure resilience, auto-recovery, and testing


  • EC2 operations
  • General AWS networking knowledge
  • Familiarity with Auto Scaling
  • Cloud storage solutions

Picture this: You've finished your cloud computing studies and now you're ready for the real world, except it's a little bit different than what you did in practice. Every instance, every networking component, and every service that you use for production workloads needs to be redundant. The days of a single T2 micro instance to deploy a web server are over. Now, you need to assume that everything will fail all the time, and that your cloud workloads will recover on their own when a component stops working. It's time to manage and plan for failures. Hello and welcome. My name is Carlos Rivas, and I'm a senior AWS content creator here at Cloud Academy. Feel free to reach out if you have any questions using the details shown on the screen, or you can always get in touch with us by sending an email to, or one of our cloud experts will reply to your question. So, who should attend this course?

Well, this course is a bit advanced, and ideally, you're already familiar with AWS infrastructure and its main components. The knowledge in this course will help you identify resources that you may need to back up or double up for redundancy. DevOps engineers will also benefit from this course by expanding their knowledge in the areas of infrastructure resilience, auto-recovery, and testing. By the end of this course, you should have a good understanding of a few strategies used to handle failures and how to recover from disasters. Some of the key points we'll be covering in this course include: backups and how to use them to recover from failure, fault isolation to make sure one disaster doesn't create another, testing for reliability; that is intentionally crashing one or more components of your applications, disaster recovery, and designing for auto-recovery.

Some useful knowledge to have before starting includes: EC2 operations, general AWS networking knowledge, familiarity with Auto Scaling, and cloud storage solutions. Feedback on our courses here at Cloud Academy is valuable to both us as trainers and any students looking to take the same course in the future. If you have any feedback, positive or negative, it would be greatly appreciated if you could contact All right, let's get started.


About the Author
Carlos Rivas
Sr. AWS Content Creator
Learning Paths

Software Development has been my craft for over 2 decades. In recent years, I was introduced to the world of "Infrastructure as Code" and Cloud Computing.
I loved it! -- it re-sparked my interest in staying on the cutting edge of technology.

Colleagues regard me as a mentor and leader in my areas of expertise and also as the person to call when production servers crash and we need the App back online quickly.

My primary skills are:
★ Software Development ( Java, PHP, Python and others )
★ Cloud Computing Design and Implementation
★ DevOps: Continuous Delivery and Integration


Covered Topics