Introduction to Operations
What happens once your software is actually running in production? Ensuring that it stays up-and-running is important. And depending on what the system does, and how much traffic it needs to handle, that may not be particularly easy.
There are systems that will allow developers to run their code and not need to think about it. Platform as a service options like Google’s App Engine go a long way to reducing, and in some companies removing operations. However, not every system can or will run on such platforms. Which means that having qualified operations engineers is an important thing.
The role of an operations engineer is continually evolving; which isn’t a surprise since changes in technology never slows down.
So, if the job falls on you to keep a system up-and-running, where do you start? What needs to happen? These are the questions this course aims to answer.
In this course we take a look at some of tasks that operations engineers need to address. I use the term operations engineer as an umbrella, to cover a wide variety of job titles. Titles such as ops engineer, operations engineer, site reliability engineer, devops engineer, among others, all fall under this umbrella.
Regardless of the name of the title, the responsibilities involve keeping a system up-and-running, with little or no downtime. And that’s a tough thing to do because there are a lot of moving parts.
If you’re just starting out, and are interested in one of those roles, then the fundamentals in this course may be just what you need. These fundamentals will prepare you for more advanced courses on specific cloud providers and their certifications.
Topics such as high availability are often covered in advanced courses, however they tend to be specific to a cloud provider. So this course will help you to learn the basics without needing to know a specific cloud provider.
If this all sounds interesting, check it out! :)
By the end of this course, you'll be able to:
- Identify some of the aspects of being an ops engineer
- Define why availability is important to Ops
- Define why scalability is important to Ops
- Identify some of the security concerns
- Define why monitoring is important
- Define why practicing failure is important
This is a beginner level course for anyone that wants to learn. Though probably easier if you have either:
- Development experience
- Operations experience
What You'll Learn
|Lecture||What you'll learn|
|Intro||What will be covered in this course|
|Intro to Operational Concerns||What sort of things to operations engineers need to focus on?|
|Availability||What does availability mean in the context of a web application?|
|High Availability||How do we make systems more available than the underlying platform?|
|Scalability||What is scalability and why is it important?|
|Security||What security issues to ops engineers need to address?|
|Infrastructure as code||What is IaC and why is it important?|
|Monitoring||What things need to be monitored?|
|System Performance||Where are the bottlnecks?|
|Planning and Practicing Failure||How can you practice failure?|
|Summary||A review of the course|
Welcome back to Introduction to Operations. I'm Ben Lambert, and I'll be your instructor for this lesson.
In this lesson, we're going to talk about what it means to be available and what factors may contribute to availability. We'll also talk about how the cloud has given everyone the potential to achieve higher availability.
In the previous lesson, we talked about experiencing the dreaded, albeit rare, Netflix outage and we implicitly defined availability as a part of that discussion. But let's define it explicitly here so that we're using a shared language throughout the course. We're going to say that availability is about having something that's ready and able to be used, which means if something is available, then it is usable, and if it's not available, then it's not usable.
In the context of a web application such as Netflix or Etsy, this would mean that if you're unable to use their services and the issue is on their end, then the system is unavailable. Now that's an important distinction. Their service is not unavailable if your internet connection is down. So based on this, we're talking about uptime and downtime. If a system is up, then it's available, and vice versa.
So we have an idea of what availability is, but we haven't talked about why it matters. Now there isn't a singular answer to why system availability matters. Each system is gonna have its own requirements for how available it needs to be, and so the reasons for its availability will be contextual. However, some of the common drivers for availability are money, safety, and information.
Now the first two may stand out as being fairly obvious, outages can cost money for companies and they could also cost lives. Imagine if something was as important as a flight control system and that went down. This could potentially cost people their lives. Now the third one may be less quantifiable. Imagine that the systems that run the Large Hadron Collider had issues with availability. Would we still have discovered the Higgs boson particle in our lifetime? So having systems that are responsible for scientific research that have low availability could set back research into important work.
So there are many different reasons why availability matters for systems, and the reason is contextual based on the system in question.
That may lead you to wonder why we don't build all systems for 100% availability, and that's a good question. It'd be nice if all the systems we build were usable and available all the time. However, it ignores the real world constraints that exist. Now there are a lot of potential reasons why we as engineers may build something with a lower availability.
First, some systems aren't that important. Imagine you have a small web application that's hosted internally to your company and its primary goal is to send a monthly company newsletter. And obviously this is an overly-simplistic system to help make a point, however, a system like this doesn't require 100% uptime and probably doesn't even need to be at 95%. So not all systems are going to require high availability, and this becomes a factor that we need to consider when architecting systems. It's a business decision to determine the desired availability of a system.
Next, cost can be prohibitive. Even if a system could benefit from 100% availability, cost could be prohibitive. Even with the use of the cloud, some systems would be too expensive to architect for 100% uptime. So cost is a factor when architecting for high availability. This is another business decision. Businesses need to try and get the most uptime possible for their system's operational budget.
Cost to availability doesn't scale linearly. As an example, the cost jump from 95% availability to achieving 99% availability may only cost a couple hundred dollars a month, but the cost from 99 to 99.999% may be several thousand dollars monthly.
Next, system or platform limitations could be an issue. With the adoption of the cloud, this is becoming less of an issue, however, there are still legacy systems that need to be run and managed that don't actually support the techniques for horizontal scaling, which is one mechanism that we use to help with high availability.
So these are just three of the many possible reasons why systems may not be built with 100% availability. And cost tends to be the largest factor in why something like 99.999% uptime is a more achievable number.
So, when we talk about availability, it can see like we're sometimes not talking about time because we talk about it in percentages. We say 99%, 99.9, et cetera. Let's take a look at this table here. This shows the actual allowed time based on the availability number. This chart comes from Wikipedia, and you can see there's a pretty sizable difference between 90% uptime, allowing for 36.5 days annually, compared to say the five minutes and 26 seconds that a 99.999% uptime allows.
Now that we know that availability is about having systems that are up and running and usable by those who need to use them, we should talk about some of the reasons that systems can become unavailable. When talking about something like this, there are really just too many possible reasons so the best we can do is narrow it down to a few common reasons. So that's what we'll do here.
First, we have software bugs. For websites that have minimal complexity, just a few webpages, there still could be hundreds of lines of code that are responsible for it. So if you factor in the web server, and the operating system, the number of lines of code skyrockets. So a single line of code changing or a configuration file changing could break everything, and the potential for breaking changes grows as the complexity of the site grows. So the code and configuration that is supposed to keep everything running could easily break something and bring it offline.
Next we have component overload. If a server could handle all the traffic we threw it, then we wouldn't have much to talk about. However, servers can only handle a finite amount of traffic. The exact amount depends on a lot of factors such as the hardware, the operating system, and the software running, the network, et cetera, so when a web server, a router, or some other component gets to the point that it's working too hard, it can start throwing errors and stop serving up responses.
Imagine it's like a tollbooth on the highway. When traffic is light and the cars are going slow one at a time through the tollbooth and then drive off, the cars behind them are unaffected. However, when there's a lot of traffic, this single tollbooth is a bottleneck, forcing cars to come to a complete stop, sometimes for miles back.
Next we have natural disasters. If all of the resources required to run your site are hosted in a single data center and a natural disaster disrupts the normal operations of that data center, chances are your site will be down until the normal operations for that data center are resumed. This has happened before; it'll happen again. Nature doesn't care if we have uninterrupted access to the services we want to use. This can be a challenge because it may require a separate, redundant system in a different geographic region.
Next up we have hardware failure. Hardware components will eventually fail, and that can cause outages for anything that relies on that hardware. This means that we need to consider the potential for hardware failures when we architect systems. Cloud systems can help us abstract away from some of the potential implications of hardware failure, though not all.
Next we have malicious users. This can be in the form of a denial of service attack or something else. Denial of service attack is a form of attack where a system is flooded with requests to the point where it can no longer handle the legitimate traffic. This is a common tactic by groups looking to extort money from companies that require their systems to be up and running. They use a distributed denial of service attack and then offer to stop for a price. So this requires us to implement mechanisms that can help us to prevent these sorts of issues from causing an outage to legitimate users.
And this has only been five of the potentially limitless causes for systems to become unavailable. However, in these five, we've covered a lot of the typical causes for system failures. Let's do a quick recap because we've covered a lot in this lesson.
Availability means that a system is up and running and usable. And availability matters for different reasons, depending on what the system does and what the users expect of it. Unless you start working on things like communications satellites or something along those lines, targeting 100% uptime, it just won't be practical, and there are a variety of reasons that systems will become unavailable.
In our next lesson, we're gonna continue the conversation on availability. Now I know it may feel like we're spending a lot of time on it, and that's true, however availability is an important part of modern systems, so we'll dive in a bit more. We're gonna go in depth in the next lesson and talk about how to design systems that have a greater availability and the underlying platform and how managing risk can help us ensure our systems can remain available even if one component fails.
Okay, whenever you're ready, let's dive in.
Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.