Start course
1h 21m

What happens once your software is actually running in production? Ensuring that it stays up-and-running is important. And depending on what the system does, and how much traffic it needs to handle, that may not be particularly easy.

There are systems that will allow developers to run their code and not need to think about it. Platforms as a service option like Google’s App Engine go a long way to reducing and, in some companies, removing operations. However, not every system can or will run on such platforms. Which means that having qualified operations engineers is an important thing.

The role of an operations engineer is continually evolving; which isn’t a surprise since changes in technology never slows down.

So, if the job falls on you to keep a system up-and-running, where do you start? What needs to happen? These are the questions this Course aims to answer.

In this Course, we take a look at some of tasks that operations engineers need to address. I use the term operations engineer as an umbrella, to cover a wide variety of job titles. Titles such as ops engineer, operations engineer, site reliability engineer, devops engineer, among others, all fall under this umbrella.

Regardless of the name of the title, the responsibilities involve keeping a system up-and-running, with little or no downtime. And that’s a tough thing to do because there are a lot of moving parts.

If you’re just starting out, and are interested in one of those roles, then the fundamentals in this Course may be just what you need. These fundamentals will prepare you for more advanced Courses on specific cloud providers and their certifications.

Topics such as high availability are often covered in advanced Courses, however they tend to be specific to a cloud provider. So this Course will help you to learn the basics without needing to know a specific cloud provider.

If this all sounds interesting, check it out! :)

Course Objectives

By the end of this Course, you'll be able to:

  • Identify some of the aspects of being an ops engineer
  • Define why availability is important to ops
  • Define why scalability is important to ops
  • Identify some of the security concerns
  • Define why monitoring is important
  • Define why practicing failure is important

Intended Audience

This is a beginner level Course for anyone that wants to learn. Though probably easier if you have either:

  • Development experience
  • Operations experience

Optional Pre-Requisites

What You'll Learn

Lecture What you'll learn
Intro What will be covered in this Course
Intro to Operational Concerns What sort of things to operations engineers need to focus on?
Availability What does availability mean in the context of a web application?
High Availability How do we make systems more available than the underlying platform?
Scalability What is scalability and why is it important?
Security What security issues to ops engineers need to address?
Infrastructure as code What is IaC and why is it important?
Monitoring What things need to be monitored?
System Performance Where are the bottlnecks?
Planning and Practicing Failure How can you practice failure?
Summary A review of the Course



Welcome back to Introduction to Operations. I'm Ben Lambert and I'll be your instructor to this lesson.

In this lesson we're going to summarize what we've covered throughout the course. We've covered a lot and it's easy for important information to get lost or forgotten. What are the key points that we should take away from this course?

First, Availability is about having a system, up and running and usable by its user base when they need it and while keeping a system up and running is a technical issue, the required up time is a business decision. Not all systems are equal and don't need or require the same levels of architecture.

Next, to help systems remain available under high load, you'll need to scale your system. You can scale it up or you can scale it out. Scaling up tends to be a bit slower to do and will reach a hardware cap at some point. However, it is a viable option for a lot of systems that grow slowly. Scaling out tends to be a bit faster to do and allows for a higher degree of scalability, however your application needs to support it. If you're servers store data locally that needs to be shared then your application needs to be refactored.

Next, security should be a role shared by all. It's not just for security engineers. When it comes to preventing a distributed denial service attack and security in general, you should aim to reduce the attack surface. Also, your system needs to be scalable in order to handle any increased load that does make it through. Keeping systems up to date with patches is important as well since many targeted attacks rely on unpatched systems.

Next up, managing complex infrastructure can be challenging. However, having your infrastructure in code helps to make it more consistent, reproducible and faster and all while having it under version control.

Next we have monitoring. It's important to know what happens with your application and your infrastructure. Monitor everything it makes sense to and try to get all the data into one location. Next, performance tuning takes effort. Don't spin your wheels. Know what the acceptable performance threshold is and use the data from the monitoring to determine if your changes are useful.

And finally, plan and test for failures. It's one of the best ways to ensure that your system is going to remain available in the face of component outages.

And this wraps up our introduction to operations. I hope this has been useful to you. I know I've enjoyed creating it. If you have any questions, please feel free to reach out to me on the community forums. Myself and my fellow instructors hang out there and we love talking with you guys. I hope to hear from you.

For Cloud Academy, I'm Ben Lambert. Thanks for watching!

About the Author
Learning Paths

Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.

Covered Topics