Introduction to Operations
What happens once your software is actually running in production? Ensuring that it stays up-and-running is important. And depending on what the system does, and how much traffic it needs to handle, that may not be particularly easy.
There are systems that will allow developers to run their code and not need to think about it. Platforms as a service option like Google’s App Engine go a long way to reducing and, in some companies, removing operations. However, not every system can or will run on such platforms. Which means that having qualified operations engineers is an important thing.
The role of an operations engineer is continually evolving; which isn’t a surprise since changes in technology never slows down.
So, if the job falls on you to keep a system up-and-running, where do you start? What needs to happen? These are the questions this course aims to answer.
In this course, we take a look at some of tasks that operations engineers need to address. I use the term operations engineer as an umbrella, to cover a wide variety of job titles. Titles such as ops engineer, operations engineer, site reliability engineer, devops engineer, among others, all fall under this umbrella.
Regardless of the name of the title, the responsibilities involve keeping a system up-and-running, with little or no downtime. And that’s a tough thing to do because there are a lot of moving parts.
If you’re just starting out, and are interested in one of those roles, then the fundamentals in this course may be just what you need. These fundamentals will prepare you for more advanced courses on specific cloud providers and their certifications.
Topics such as high availability are often covered in advanced courses, however they tend to be specific to a cloud provider. So this course will help you to learn the basics without needing to know a specific cloud provider.
If this all sounds interesting, check it out! :)
By the end of this course, you'll be able to:
- Identify some of the aspects of being an ops engineer
- Define why availability is important to ops
- Define why scalability is important to ops
- Identify some of the security concerns
- Define why monitoring is important
- Define why practicing failure is important
This is a beginner level course for anyone that wants to learn. Though probably easier if you have either:
- Development experience
- Operations experience
What You'll Learn
|Lecture||What you'll learn|
|Intro||What will be covered in this course|
|Intro to Operational Concerns||What sort of things to operations engineers need to focus on?|
|Availability||What does availability mean in the context of a web application?|
|High Availability||How do we make systems more available than the underlying platform?|
|Scalability||What is scalability and why is it important?|
|Security||What security issues to ops engineers need to address?|
|Infrastructure as code||What is IaC and why is it important?|
|Monitoring||What things need to be monitored?|
|System Performance||Where are the bottlnecks?|
|Planning and Practicing Failure||How can you practice failure?|
|Summary||A review of the course|
Welcome back to Introduction to Operations. I'm Ben Lambert, and I'll be your instructor for this lesson.
In this lesson, we're going to talk about performance. And why it's something that operations engineers have to think about. What is performance in the context of a web application? It's the speed and efficiency of each component and the system as a whole. Okay, what does that mean? It basically boils down to mean, how quickly does the system respond to user requests. Now, there's more to it than that. But if we were to try and make a definition that's as simple as possible, I think that would work.
So, we're saying, system performance is how quickly the system responds to user requests. With this definition in mind, where does operations begin when trying to measure and improve system performance? Remember, in the lesson on monitoring, we said, that once all the data was in place, teams could use it as they need it. Well, this is one of those times. Having quantifiable data, showing where performance problems might exist will help us give a starting place when it comes time to make system changes. Having actual data is very important.
If we don't know how long requests are taking, then we need to first implement some sort of monitoring solution. The reason monitoring is important is that perceived performance is subjective. What does that mean? It means that two people observing the same thing may have differing opinions of what they're seeing. As an example, if two people are both watching a car pass by and one person says, wow that car was moving fast, and the other replies, not really. Both are correct based on their perception of what fast means. So, if two users browse to your webpage and it takes three seconds to load, one person may find that to be just fine. And the other may be thinking, wow, anything over a second is so slow.
So we need to be working with quantifiable numbers when we think about system performance. But so far, we've only talked about measuring actual performance. We haven't talked about performance goals for a system. And that's because, the acceptable threshold for performance needs to be a business decision. There needs to be some sort of technical limitation, sure. For example, making sure every page loads in under one millisecond is probably not going to happen. But, it's a business decision to set the ideal performance numbers. Once you have a target for that range, let's say it's under one second, then you have something to shoot for. And with the data you've collected from you monitoring, you can go to identifying the slowest constraints and start to remove them.
We've talked about measuring system performance and setting a performance target. However, we haven't talked about why performance is important in this lesson. We mentioned it previously that users these days are spoiled by sites with impressive levels of performance. Google, Netflix, Amazon, among others have instilled in a lot of us a subconscious expectation of high performance sites. So, when we go to a site that takes five seconds to load, we often just lose interest in that site. Users have high expectations for how quickly sites and services should load. Okay, hopefully by this point, the value in having both a performance target and actual system metrics is clear. Without a target, we don't have anything to aim for with any system changes we make. And if, we don't have actual metrics, we can't quantify the impact, either good or bad of any of the changes we make. S
o, once you have these numbers, what's next? This is where we start to go through the application and architecture and make sure we find and remove the bottle necks. The more complex the software, the more places there are for bottle necks to hide. Every time a web server runs some code or makes a call to a database, or some rest API, we introduce latency. If things are running well, then that latency shouldn't be too noticeable.
Let's talk through some of the common causes for system performing issues. Here we have a load balancer in front of some web application servers. The load balancer needs to be fast. They need to handle a lot of traffic and if they're not fast enough, then the entire application will be slow. High quality load balancers both in and out of the cloud, shouldn't be adding any substantial amounts of time to the request. If they are, you need to test and verify that this is in fact the load balancer and not something such as the network. If it is the load balancer, look into the documentation for that particular vendor's product and make sure that the settings are correct. When it comes to cloud based load balancers, if you're seeing latency, test the back end servers and see how quickly they're responding.
It's not uncommon to see system performance related issues related to application code having some bit of logic that causes latency. This could be your code or some third party code. Also, slow database queries are very common. Also not enough CPU for your running processes or CPU being consumed by some other process which is starving your application is fairly common. And the same applies for memory.
Often times in web application, you'll be serving up a lot of static files. Images, Jave Script, CSS, PDF's, among other assets. For high traffic sites, this tends to cause performance issues if you're not using a content delivery network. Services such as CloudFront on AWS, allow you to serve up files to the users from cache locations that are geographically close to the user. This makes it faster for the users and frees up your servers to handle more important things than serving static assets.
These are just a few of the areas where performance issues are often found. When it comes to locating the source or sources of latency, I like to start by reviewing the APM tools data and by running the sites through something like Google's PageSpeed Insights. I find these tools to be a great place to start for identifying performance problems. Tools like PageSpeed are a great way to understand what kind of things could be causing applications to run slowly. I recommend that you check it out, it's free to use. Enter the URL for your site and click analyze. Go through all of the items, even the ones that pass.
It'll help you recognize some common causes for slow performance. There are a lot of mechanisms that will help you improve the systems performance. Things like caching, edge locations, content compression, among other things. And they're great to know about. Though, we won't be getting into that level of detail today. Throughout this course, we've talked about a lot of the things that could go wrong and some ways to help ensure that when things break, we have a fall back.
However, how do you know when those things break if the fall backs are working. This is where planning and practicing failure comes in handy. What exactly does that mean? That's what we'll find out in our next lesson. So, if you're ready, let's get started.
Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.