Planning and Practicing Failure


Course Intro
3m 6s

The course is part of these learning paths

Start course
1h 21m

What happens once your software is actually running in production? Ensuring that it stays up-and-running is important. And depending on what the system does, and how much traffic it needs to handle, that may not be particularly easy.

There are systems that will allow developers to run their code and not need to think about it. Platforms as a service option like Google’s App Engine go a long way to reducing and, in some companies, removing operations. However, not every system can or will run on such platforms. Which means that having qualified operations engineers is an important thing.

The role of an operations engineer is continually evolving; which isn’t a surprise since changes in technology never slows down.

So, if the job falls on you to keep a system up-and-running, where do you start? What needs to happen? These are the questions this course aims to answer.

In this course, we take a look at some of tasks that operations engineers need to address. I use the term operations engineer as an umbrella, to cover a wide variety of job titles. Titles such as ops engineer, operations engineer, site reliability engineer, devops engineer, among others, all fall under this umbrella.

Regardless of the name of the title, the responsibilities involve keeping a system up-and-running, with little or no downtime. And that’s a tough thing to do because there are a lot of moving parts.

If you’re just starting out, and are interested in one of those roles, then the fundamentals in this course may be just what you need. These fundamentals will prepare you for more advanced courses on specific cloud providers and their certifications.

Topics such as high availability are often covered in advanced courses, however they tend to be specific to a cloud provider. So this course will help you to learn the basics without needing to know a specific cloud provider.

If this all sounds interesting, check it out! :)

Course Objectives

By the end of this course, you'll be able to:

  • Identify some of the aspects of being an ops engineer
  • Define why availability is important to ops
  • Define why scalability is important to ops
  • Identify some of the security concerns
  • Define why monitoring is important
  • Define why practicing failure is important

Intended Audience

This is a beginner level course for anyone that wants to learn. Though probably easier if you have either:

  • Development experience
  • Operations experience

Optional Pre-Requisites

What You'll Learn

Lecture What you'll learn
Intro What will be covered in this course
Intro to Operational Concerns What sort of things to operations engineers need to focus on?
Availability What does availability mean in the context of a web application?
High Availability How do we make systems more available than the underlying platform?
Scalability What is scalability and why is it important?
Security What security issues to ops engineers need to address?
Infrastructure as code What is IaC and why is it important?
Monitoring What things need to be monitored?
System Performance Where are the bottlnecks?
Planning and Practicing Failure How can you practice failure?
Summary A review of the course



Welcome back to Introduction to Operations. I'm Ben Lambert and I'll be your instructor for this lesson.

In this lesson, we're going to talk about what it means to plan and practice failure. What would happen if one of your web servers went down? Or an entire data center went down? Do you have a plan for what will happen? If so, have you tested it? Failure is a part of technology. Components and systems will fail. Some will be under our control and others won't, and we need to not only accept that failure will happen, but embrace it. We need to identify where failures are likely to happen and make sure that our systems are more resilient. It's what we've talked about throughout this course.

We talked about how to ensure that systems are available even if there is a failure in some component. For example, if one of your web servers was to fail, would some sort of auto-scaling mechanism replace it? Or if a zone was to become unreachable, remember zones are synonymous with data centers, would your system go down or would it recover? If you want to know how well your system tolerates failures, you need to test for it.

So, how do we test for failure? The best way is to cause a controlled failure and see what happens. Now, you can start out on an environment that mirrors production and eventually move on to see what happens in production. Netflix created a tool called Chaos Monkey and it does just that. It runs in AWS and shuts down components. You don't need to use that, you can use something similar or even manually disable systems but causing failures and allowing you to see what happens when the system breaks is incredibly valuable. If your system holds up to components breaking, then you can have a higher degree of confidence in your system, and if not, that's great because you'll learn how to make a more resilient system by seeing what happens when things break.

So, you'll want to create some sort of schedule to regularly test that your system can tolerate failure without interruption to your users. Systems grow, they change, they evolve. If your systems handle failure well today, that doesn't mean they will in a month or a year. We talked about creating available systems earlier in the course and using scalability to ensure that they'll hold up to the traffic. So, if you're going to invest in all of that effort to try and build a system that's resilient, take the extra step and prove it by testing for failures. We won't go any further in depth here.

We'll save that for a more advanced course, however, keep in mind, if we plan for failures with our architecture, we should prove it is at least as resilient as we expect it to be. In our next lesson, we're gonna summarize the things we've covered throughout this course.

So, if you're ready to wrap up the course, let's dive into our final lesson.

About the Author
Learning Paths

Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.

Covered Topics