The Challenges of Ops


Course Intro
3m 6s
The Challenges of Ops
1h 21m

What happens once your software is actually running in production? Ensuring that it stays up-and-running is important. And depending on what the system does, and how much traffic it needs to handle, that may not be particularly easy.

There are systems that will allow developers to run their code and not need to think about it. Platforms as a service option like Google’s App Engine go a long way to reducing and, in some companies, removing operations. However, not every system can or will run on such platforms. Which means that having qualified operations engineers is an important thing.

The role of an operations engineer is continually evolving; which isn’t a surprise since changes in technology never slows down.

So, if the job falls on you to keep a system up-and-running, where do you start? What needs to happen? These are the questions this course aims to answer.

In this course, we take a look at some of tasks that operations engineers need to address. I use the term operations engineer as an umbrella, to cover a wide variety of job titles. Titles such as ops engineer, operations engineer, site reliability engineer, devops engineer, among others, all fall under this umbrella.

Regardless of the name of the title, the responsibilities involve keeping a system up-and-running, with little or no downtime. And that’s a tough thing to do because there are a lot of moving parts.

If you’re just starting out, and are interested in one of those roles, then the fundamentals in this course may be just what you need. These fundamentals will prepare you for more advanced courses on specific cloud providers and their certifications.

Topics such as high availability are often covered in advanced courses, however they tend to be specific to a cloud provider. So this course will help you to learn the basics without needing to know a specific cloud provider.

If this all sounds interesting, check it out! :)

Course Objectives

By the end of this course, you'll be able to:

  • Identify some of the aspects of being an ops engineer
  • Define why availability is important to ops
  • Define why scalability is important to ops
  • Identify some of the security concerns
  • Define why monitoring is important
  • Define why practicing failure is important

Intended Audience

This is a beginner level course for anyone that wants to learn. Though probably easier if you have either:

  • Development experience
  • Operations experience

Optional Pre-Requisites

What You'll Learn

Lecture What you'll learn
Intro What will be covered in this course
Intro to Operational Concerns What sort of things to operations engineers need to focus on?
Availability What does availability mean in the context of a web application?
High Availability How do we make systems more available than the underlying platform?
Scalability What is scalability and why is it important?
Security What security issues to ops engineers need to address?
Infrastructure as code What is IaC and why is it important?
Monitoring What things need to be monitored?
System Performance Where are the bottlnecks?
Planning and Practicing Failure How can you practice failure?
Summary A review of the course



Welcome back to Introduction to Operations. I'm Ben Lambert and I'll be your instructor for this lesson.

In this lesson, we'll talk about the role of operations, often called ops, and what it is an operations engineer does.

If you watched the course on continuous delivery, you saw that we had a couple of methods for getting software deployed into a production environment with minimal downtime.

But what happens after you've deployed to production?
How do you ensure your system stays up and running?
How do you keep them secure?
How do you make these systems perform quickly?
How do you manage the changes in your environment in a sustainable way?

These are some of the tasks that operations engineers are responsible for handling either in part or in full and these are also going to be the topics of the coming lessons. So we'll be going a bit more in depth on these in the course as it moves on.

The role of an operations engineer has been changing over the years and with the wide scale adoption of cloud infrastructure, it continues to evolve to meet the demands of the industry. As technology advances and the Internet connections continue to get faster and faster, people are consuming ever larger amounts of content.
This demand for more content brings with it a host of challenges that need to be addressed, challenges such as highly available systems.

Have you ever got home from a long day, turned on the television, fired up your favorite streaming content device, maybe Apple TV, Chromecast or Roku or some similar device, clicked on the Netflix app and received an error message indicating that for some reason the system is temporarily down. If you answered yes, how did you feel about it?

Because if you were to search on Twitter for people that are experiencing Netflix outages, you'd see that some people are rather passionate about how their life is over because of a brief service interruption. Netflix has a system that's so resilient that these sort of rare outages cause people to have very strong feelings about it. Systems like Netflix, Etsy, Dropbox, among others have spoiled us as data consumers.
So high availability has become a part of the collective consciousness. As users of systems, we expect these systems to be available to us whenever we want to use them and that goes doubly for systems that we pay for.

Another challenge that comes along with the increased demand for content is data security. What was the most recent data leak you recall hearing about? Have you gotten to the point where you don't notice them anymore because they're so frequent? Have you ever had your data included in one of the many leaks? Or what about identity theft, have you ever experienced that? If so, how long did it take to get that cleared up?
When we use different systems, we're trusting our data to these companies and to the people that run those systems. So the more sensitive the data, the more security we expect around that data.

System performance is another challenge and it's something that can be difficult to manage since the perception of speed can be subjective. A website that loads in two seconds may seem quick to some, but slow to others. Have you ever clicked on a link and waited a few seconds for the page to load and then nothing happens? How much longer do you wait? We've been spoiled by so many good examples of high-performance systems that we're losing our web patience. There are a lot of components to a modern web-based application and every component offers some potential for added latency.
Perceived performance being subjective means that operations engineers need to base the work that they do on solid metrics or they're going to be spinning their wheels. Establishing a base line for system and application performance is crucial to improving the system performance.

Based on these three issues alone, availability, security, and performance, you may have noticed that operations engineers have their work cut out for them and that's true. Operations engineers have to solve some difficult problems and some have readily available solutions and others will be new and possibly unique to a specific problem domain. No matter what the current evolution of the role is or no matter what the job title they go by is, be it ops, ops engineers, site reliability engineers, dev ops engineers or something else, these responsibilities among others fall to them. So for this course, we're not gonna get caught up in titles and instead we'll talk about the work that needs to be done to operate modern applications at scale.

In the next lesson, we're gonna dive into availability. It's a fairly broad topic with a lot to cover. So if you're ready, let's dive in!

About the Author
Learning Paths

Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.

Covered Topics