What happens once your software is actually running in production? Ensuring that it stays up-and-running is important. And depending on what the system does, and how much traffic it needs to handle, that may not be particularly easy.
There are systems that will allow developers to run their code and not need to think about it. Platforms as a service option like Google’s App Engine go a long way to reducing and, in some companies, removing operations. However, not every system can or will run on such platforms. Which means that having qualified operations engineers is an important thing.
The role of an operations engineer is continually evolving; which isn’t a surprise since changes in technology never slows down.
So, if the job falls on you to keep a system up-and-running, where do you start? What needs to happen? These are the questions this Course aims to answer.
In this Course, we take a look at some of tasks that operations engineers need to address. I use the term operations engineer as an umbrella, to cover a wide variety of job titles. Titles such as ops engineer, operations engineer, site reliability engineer, devops engineer, among others, all fall under this umbrella.
Regardless of the name of the title, the responsibilities involve keeping a system up-and-running, with little or no downtime. And that’s a tough thing to do because there are a lot of moving parts.
If you’re just starting out, and are interested in one of those roles, then the fundamentals in this Course may be just what you need. These fundamentals will prepare you for more advanced Courses on specific cloud providers and their certifications.
Topics such as high availability are often covered in advanced Courses, however they tend to be specific to a cloud provider. So this Course will help you to learn the basics without needing to know a specific cloud provider.
If this all sounds interesting, check it out! :)
Course Objectives
By the end of this Course, you'll be able to:
- Identify some of the aspects of being an ops engineer
- Define why availability is important to ops
- Define why scalability is important to ops
- Identify some of the security concerns
- Define why monitoring is important
- Define why practicing failure is important
Intended Audience
This is a beginner level Course for anyone that wants to learn. Though probably easier if you have either:
- Development experience
- Operations experience
Optional Pre-Requisites
What You'll Learn
Lecture | What you'll learn |
---|---|
Intro | What will be covered in this Course |
Intro to Operational Concerns | What sort of things to operations engineers need to focus on? |
Availability | What does availability mean in the context of a web application? |
High Availability | How do we make systems more available than the underlying platform? |
Scalability | What is scalability and why is it important? |
Security | What security issues to ops engineers need to address? |
Infrastructure as code | What is IaC and why is it important? |
Monitoring | What things need to be monitored? |
System Performance | Where are the bottlnecks? |
Planning and Practicing Failure | How can you practice failure? |
Summary | A review of the Course |
Welcome back to Introduction to Operations. I'm Ben Lambert, and I'll be your instructor for this lesson.
In this lesson, we'll talk about monitoring, and why it's an important part of Operations. Even the simplest of applications has a lot of moving parts, and produces a lot of data and logs. Let's imagine you have a basic web application running behind a load balancer, with a couple of servers and using a SQL database. So we'll have logs from the load balancer, the web servers, the applications on each virtual machine, and maybe from some reverse proxy, and any third-party systems that the application uses, and then from the database.
That ends up being a lot of moving parts. Do you know how well they're interacting? Can you pinpoint where performance issues are occurring? Monitoring will help you to measure the health of your system. It's important because if you don't know how your system is performing, you have no baseline for improvements. And you'll probably end up being surprised when a component of your system goes down. So, monitoring is an important thing because we need to know how different components that comprise our system are performing. But, with so many different things that need to be monitored, where do you start?
Here's a list of some of the things that you'll need to monitor. Application performance, server performance, and cloud resources. Let's go through these and talk about how we can monitor each one. Starting with application performance. Knowing what your application code is doing, and having an understanding of how it's behaving is going to help you identify performance issues. And to do that, you'll need to use an application performance monitoring tool, also called an APM tool. These tools allow you to see what's happening at the code level. They can show you the methods being called and the SQL queries that are being run. Having this level of visibility will help developers to know exactly where problems are occurring.
Let's imagine that you're in charge of running a small website. You're using an APM to monitor your application's performance, and the developers are done with a set of features and they commit it to git and push it to GitHub. Your CI CD process tests it, and all the tests pass, and it gets deployed to production. You notice that over the next day or so, you're seeing the average page load for your search page increased by a full second. It's not enough to cause any of the automated tests to throw up any red flags and stop the build, but you still notice that it's taking a bit longer. You notice that the page load function is calling a function that is making 30 SQL queries, with the only distinction being the value in the where clause. It looks like it's being used to generate the facets on the left-hand sidebar for your faceted search. You bring this to the developer, and they know right away what's happening. The ORM was set to use lazy loading by default. So they switch it to eager loading, and now in place of those 30 queries, there's only the one. After the change has been made and deployed, you find that the page load is now basically back to normal. If you didn't have this sort of quantifiable information, what do you think would have happened when you told the developers that the page seems slower? Maybe you'd end up having it fixed, and maybe not. But since you did have the data, it became easier to both show the problem and show where the problem was located. Now in case you're wondering, I have had a similar experience. Only in that case, it was much worse than a single second of additional page time.
So how do you go about finding an APM tool? There are two hosted SAS options that I like. One is New Relic's APM, and the other is AppDynamics' APM. They both offer similar functionality and help to gain insight into the interactions of both your apps and microservices. They can show you your top slowest queries and top slowest requests, and help provide insight into why they're slow. There are some APM tools out there that you can run on your own, tools such as Appdash, depending on your needs, something like this may work well. Oftentimes, getting APM tools integrated into your application happens outside of the code and can be done without requiring any development effort.
Next, let's talk about server performance. With all of the software running on our servers, we have a lot of things that could break. A process could lock up, it could consume all of the memory and hinder other apps, it could crash and cause other services that rely on it to crash. If we don't know what's happening under the hood, we can only guess and hope that everything is going well. We need some sort of quantifiable information, the same way that we gained that quantifiable information from APM tools. Server monitoring should be done, ideally, in one central location. Having the ability to see how all of your servers are performing, regardless of if they're in the cloud, on-site, or some hybrid. Server monitoring should show you all of your servers. They should be able to drill into a single server and see the details about running processes, the memory usage, disk and network IO, and ideally, they'll allow you to set alerts.
There are a lot of tools available to select from. New Relic and AppDynamics both have good options. Splunk is another, Nagios is one, and DataDog is another good option. And it's one that I like, it has a hosted option, with a lot of integrations, and the ability to create your own dashboards. So, there are a lot of good options. And we've only named a few. Diving into these in-depth is a bit outside of the scope of this course, but I recommend taking the time to check out some of these options. Server monitoring is important. There are a lot of things happening on any given server. Do you have services running that should be disabled, that are eating up system resources? Is your application leaking memory? If you're not familiar with that term, it's where an application doesn't correctly release the memory when it's done using it. So the application can just keep using and holding on to memory. Server monitoring will help you identify these and other problems, hopefully well before any sort of outage.
Next, let's talk about cloud resource monitoring. There are a lot of resources that the cloud provide, that allow developers and operations to focus on what they do best. Resources such as AWS Lambda offer the ability to run code in a fully managed container. Which means, it's one less server to have to manage. However, just like when you're running code on your server, you wanna know that it's performing within some acceptable performance threshold. Or, if you wanna use something like Google's Container Engine, you probably wanna be able to monitor what's happening inside the containers. The more information you know about how your code and services are running, the more informed your decisions are. Cloud providers understand that knowing this information is important, which is why they provide services to monitor cloud resources. AWS has CloudWatch, which allows you to monitor cloud resources and even include your own custom metrics through their API. And Google bought up Stackdriver, and has integrated it not only into their cloud platform, but it can also monitor AWS as well. We won't go in detail on this. There are other courses that cover cloud monitoring, though I did at least wanna mention it here, so that it gets you thinking about it. We mentioned earlier that even the smallest application produces a lot of log files. So how do we manage all of the logs without drowning in them? We need to have them in one central location.
If your logs are stored across different services and servers, you won't have a complete picture of what's happening. And if you were to replace a virtual machine with another one, or just terminate it, as part of the scaling-down process, you'll lose the logs that are on it, unless you're collecting them in one place. There are a lot of hosted options for centralized logs. Loggly, LogStash, Splunk, Paper trail, Sumo Logic, among others, offer some mechanism to have all of your logs in one place and to parse them for notices, warnings, errors, etc. And there are other options that you can run yourself.
Tools such as Graylog are open-source, and something that you can install and host in your own environment. Logs provide a lot of information about how your software is running, what kind of problems it may be experiencing, as well as potential security threats. So, being able to mind these logs has tremendous value to different departments. Once the logs are in a central location, all of these different departments can parse them, and look for things that are relevant to them. If you go with a hosted solution or host it yourself, make sure you take the time to find the right solution for you.
Once you have all these logs, and all of the metrics from your system, what's next? We have all of this data, and depending on the systems that we select to store it in, we'll have the ability to create alerts. Alerts will allow you to specify a condition, and if that condition is met, then one or more team members will get a notification. As an example, if a service begins to take too long to respond, then maybe you wanna trigger an alert. So, if you have, say, a web server, and the response time goes from an average of 400 milliseconds to 5 seconds, then this may be something that you want to know about right away, and assign someone to look into. To handle the potential flood of alerts, tools like PagerDuty and VictorOps can help.
They work by allowing you to have all of your logs and metrics sent to them, and their system will allow you to trigger alerts based on the conditions you want. Systems like this are important because you'll have a lot of data and logs, and if alerts are triggered from different systems, it can be difficult to know where exactly the problem really is, and who should take care of it. If you monitor everything and it is all under a single user interface, then you have one location that everyone uses, so everyone is working off this same information. And once you have all of this information, different teams will be able to use it to make decisions.
We've covered a lot on monitoring, and there's still plenty to talk about. However, I think we've covered some of the basics, and it should give you an idea of some of the things that operations engineers will be monitoring. Let's do a quick summary. Using incident response systems like VictorOps to ingest all of your data will allow you to use one system as a single source of truth for alerting. And using server monitoring tools will help you to know what's happening with your servers. And the same is true of application performance monitoring tools, it'll help you identify problems at the code level.
And application performance ties into our next topic, and our next lesson. We'll talk about system performance. So if you're ready to keep going, let's get started.
Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.