Introduction to Operations
What happens once your software is actually running in production? Ensuring that it stays up-and-running is important. And depending on what the system does, and how much traffic it needs to handle, that may not be particularly easy.
There are systems that will allow developers to run their code and not need to think about it. Platforms as a service option like Google’s App Engine go a long way to reducing and, in some companies, removing operations. However, not every system can or will run on such platforms. Which means that having qualified operations engineers is an important thing.
The role of an operations engineer is continually evolving; which isn’t a surprise since changes in technology never slows down.
So, if the job falls on you to keep a system up-and-running, where do you start? What needs to happen? These are the questions this course aims to answer.
In this course, we take a look at some of tasks that operations engineers need to address. I use the term operations engineer as an umbrella, to cover a wide variety of job titles. Titles such as ops engineer, operations engineer, site reliability engineer, devops engineer, among others, all fall under this umbrella.
Regardless of the name of the title, the responsibilities involve keeping a system up-and-running, with little or no downtime. And that’s a tough thing to do because there are a lot of moving parts.
If you’re just starting out, and are interested in one of those roles, then the fundamentals in this course may be just what you need. These fundamentals will prepare you for more advanced courses on specific cloud providers and their certifications.
Topics such as high availability are often covered in advanced courses, however they tend to be specific to a cloud provider. So this course will help you to learn the basics without needing to know a specific cloud provider.
If this all sounds interesting, check it out! :)
By the end of this course, you'll be able to:
- Identify some of the aspects of being an ops engineer
- Define why availability is important to ops
- Define why scalability is important to ops
- Identify some of the security concerns
- Define why monitoring is important
- Define why practicing failure is important
This is a beginner level course for anyone that wants to learn. Though probably easier if you have either:
- Development experience
- Operations experience
What You'll Learn
|Lecture||What you'll learn|
|Intro||What will be covered in this course|
|Intro to Operational Concerns||What sort of things to operations engineers need to focus on?|
|Availability||What does availability mean in the context of a web application?|
|High Availability||How do we make systems more available than the underlying platform?|
|Scalability||What is scalability and why is it important?|
|Security||What security issues to ops engineers need to address?|
|Infrastructure as code||What is IaC and why is it important?|
|Monitoring||What things need to be monitored?|
|System Performance||Where are the bottlnecks?|
|Planning and Practicing Failure||How can you practice failure?|
|Summary||A review of the course|
Welcome back to Introduction to Operations, I'm Ben Lambert, and I'll be your instructor for this lesson.
In this lesson, we'll talk about how infrastructure is changing and what we need to do to keep up. We'll talk about infrastructure as code, what it is, and why it's become the new normal. And we'll look at some examples of what it looks like for infrastructure as code with different Cloud providers.
Let's look back at one of the system designs we created earlier. Looking at this design, how long do you think it would take for you to create this environment on a Cloud provider's platform? An hour, two, three, maybe? I suppose it depends on the platform we're using. What if you needed to create three of these? One for developers, one for exploratory testing, and one for production?
If you had to do this manually you might make mistakes when creating the environment, and some environments might be slightly different. Maybe you forget some firewall settings, or maybe you install the servers with a slightly different version of the operating system. It's easy to make these kind of mistakes, and these seemingly small mistakes can cause problems because the environments you're deploying to just don't match. So having just one third party system with a different version could bring down an entire site. Doing these sort of things manually is certainly possible, however, it's slower, it lacks consistency, and it doesn't allow for easily reproducible environments.
These are some of the things that having infrastructure in code helps to solve. So what is infrastructure as code? Infrastructure as code, often abbreviated as IaC, is a technique where we can define what our infrastructure should look like in some textual format. We use the word code, and we often use it interchangeably to mean a programming language. However, in this context, it doesn't have to be a programming language it could be some sort of configuration format or object notation.
So YAML or JSON are perfectly valid and even common formats for IaC. However, it could be a full programming language, such as Python, Ruby, or anything else.
So, IaC allows us to specify what an environment should look like, and then using whatever IaC tools we should use to actually create that environment. Which means we write out in some format, say YAML, what our environment should look like. We tell it we want to have two zones, each with auto scaling mechanisms for the virtual machine, and a database with a read-only replica, and a cross-zone load balancer, and then we provide some info to that tool. Such as the account credentials for our Cloud provider, and then we run it. If we run it once or a hundred times, the results are going to be the same. IaC gives us a fast and predictable way to create and manage systems. Let's use an example. Here we have a snippet from an Ansible Playbook that will create five Amazon EC2 Instances based on some initial virtual machine images. If you ran that, a few minutes later you would have five servers sitting and waiting for you. So having some sort of codified version of your infrastructure allows you to create and recreate the environment you need when you need it, and you'll be able to make changes in that environment in a consistent way.
If you use IaC tools to create an environment and then go and change it manually, you risk having an environment that can't be reproduced easily, because your IaC won't be in sync with your environment. So IaC should serve as the canonical source for changes to the environment. Having a single method for making changes also allows systems to be audited for compliance reasons. Now, what happens if we make a change that breaks something? This is where having things in code is even more valuable. Because our infrastructure is in code, we can have it under version control. Having it versioned, would allow us to roll back to the previous version and run that.
So in some cases it can provide a rollback mechanism. So IaC allows us to specify what our infrastructure should look like, and then make it happen. It allows us to create and update environments very quickly once we've written up that code. It ensures consistency. There are a lot of settings that go along with configuring the network, firewall settings, access control lists, et cetera. And the same applies for the rest of the infrastructure. Having it all written down in code allows it to be reproduced without mistakes. This consistency tends to save money. It can reduce the time that we spend diagnosing why something works in one environment and not another, and it can reduce the amount of time it takes to implement changes. Hatching an operating system on a hundred servers manually, or even with scripts is a pain, it's not impossible, it's just a pain.
Having it specified in code, with some sort of IaC tools allows us to patch just the right servers and audit that it happened correctly. Using an IaC tool allows operations engineers to roll out security patches with relative ease, and audit all of the systems that were caught and none of them were missed. Keeping systems up-to-date can be time consuming without something like this. And not keeping systems up-to-date can leave our systems vulnerable to attack. Infrastructure as code has become the new normal. Manual efforts or collections of random scripts just don't scale well, and end up causing inconsistencies.
Modern systems are becoming more complex, partly because the Cloud offers us the ability to create the types of robust systems that were once the domain of larger companies. So we can use Cloud providers to create a system that spans multiple data centers, and multiple geographic reasons, and with edge locations to cache and serve up content to our users from the closest point. We can't tame this sort of complexity with manual efforts easily. Specifying how our system should be configured in some textual format is our current best method for managing this sort of thing. Third party tools such as Chef, Puppet, and Ansible recognize this and have evolved to help manage infrastructure as code. However, so have the Cloud providers. AWS, Google Cloud and Azure all have some form of template that will allow you to specify how an environment should be set up. Let's take a quick look at their templates.
Now, we're not going to go into depth, however, I want to give you an idea of what tools are available. AWS has CloudFormation Templates. Here's an example of using a CloudFormation Template to create an EC2 Instance. If you're not familiar with AWS that just means creating a virtual machine. It uses a JSON format to specify what resources you want and to create or remove them. Next, we have Google Cloud Deployment Manager. This uses a YAML format. In this example, we're creating a virtual machine instance similar to how we did with CLoudFormation, but we're just using a different file format. And if we were to look at a similar example for Azure using Resource Templates, you'll see that it's just like CloudFormation using JSON. This is a truncated example of creating a Windows virtual machine instance.
So, you could use these third party tools such as Chef, Ansible, Puppet, however, you can also use the option from your cloud provider. It depends on what you're trying to do. IaC is an important part of the modern way to manage software infrastructure. And IaC tools are something that operations engineers need to understand. Future courses will likely cover some of these tools in depth. So we'll end our discussion of IaC here, however, keep in mind that no matter which IaC tools you learn the basic concepts are going to be the same across them all. So, what you've learned in this lesson should help to prepare you for when you do learn more.
In our next lesson, we're going to talk about monitoring, and why it's an important thing for operations. Okay, let's dive in.
Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.