Introduction to Operations
What happens once your software is actually running in production? Ensuring that it stays up-and-running is important. And depending on what the system does, and how much traffic it needs to handle, that may not be particularly easy.
There are systems that will allow developers to run their code and not need to think about it. Platforms as a service option like Google’s App Engine go a long way to reducing and, in some companies, removing operations. However, not every system can or will run on such platforms. Which means that having qualified operations engineers is an important thing.
The role of an operations engineer is continually evolving; which isn’t a surprise since changes in technology never slows down.
So, if the job falls on you to keep a system up-and-running, where do you start? What needs to happen? These are the questions this course aims to answer.
In this course, we take a look at some of tasks that operations engineers need to address. I use the term operations engineer as an umbrella, to cover a wide variety of job titles. Titles such as ops engineer, operations engineer, site reliability engineer, devops engineer, among others, all fall under this umbrella.
Regardless of the name of the title, the responsibilities involve keeping a system up-and-running, with little or no downtime. And that’s a tough thing to do because there are a lot of moving parts.
If you’re just starting out, and are interested in one of those roles, then the fundamentals in this course may be just what you need. These fundamentals will prepare you for more advanced courses on specific cloud providers and their certifications.
Topics such as high availability are often covered in advanced courses, however they tend to be specific to a cloud provider. So this course will help you to learn the basics without needing to know a specific cloud provider.
If this all sounds interesting, check it out! :)
By the end of this course, you'll be able to:
- Identify some of the aspects of being an ops engineer
- Define why availability is important to ops
- Define why scalability is important to ops
- Identify some of the security concerns
- Define why monitoring is important
- Define why practicing failure is important
This is a beginner level course for anyone that wants to learn. Though probably easier if you have either:
- Development experience
- Operations experience
What You'll Learn
|Lecture||What you'll learn|
|Intro||What will be covered in this course|
|Intro to Operational Concerns||What sort of things to operations engineers need to focus on?|
|Availability||What does availability mean in the context of a web application?|
|High Availability||How do we make systems more available than the underlying platform?|
|Scalability||What is scalability and why is it important?|
|Security||What security issues to ops engineers need to address?|
|Infrastructure as code||What is IaC and why is it important?|
|Monitoring||What things need to be monitored?|
|System Performance||Where are the bottlnecks?|
|Planning and Practicing Failure||How can you practice failure?|
|Summary||A review of the course|
Welcome back to Introduction to Operations, I'm Ben Lambert and I'll be your instructor for this lesson.
In this lesson we're going to pick up our discussion on availability. We talked about it in our last lesson in a rather abstract way, so in this lesson we'll dive into more concrete information. We're going to talk about how to create highly available systems in general and not specific to any one Cloud provider. From our previous lesson and maybe even from your experiences you know that systems can become unavailable for a lot of different reasons. Let's look back at the five that we talked about in our last lesson and see if we can't turn those potential reasons for outages into a more resilient system.
First up we had software bugs. In this example we have several servers and let's imagine that they're all running the same web application. Software bugs can come in the form of bugs in our code or bugs in someone else's code and by that I mean bugs that are introduced into the operating system or some third party piece of software that we use. So if we deployed some buggy piece of software to all of our systems at the same time and that bug resulted in an outage then our system would be unavailable because none of the servers would be able to respond. And so deployment strategies that can minimize this potential for downtime are how we ensure availability from software related changes. In the Introduction to Continuous Delivery course we talked about two deployment methods for avoiding downtime. If you haven't watched that course then you may want to check out at least the lesson on deployment. Mitigating downtime from software bugs requires a lot of testing at different levels of the continuous integration and continuous delivery process. And also requires deploying in a way that doesn't change the currently running servers so that should you need to roll back you can use the existing servers. There are different ways of doing this such as blue-green and canary deployments.
The important thing to know at this point is that there are well established patterns for deploying software changes to production that will help avoid downtime and you can always look them up later by searching for Zero Downtime Deployments.
Next, we talked about component overload. This is a pretty common issue and can be caused by both legitimate traffic and malicious users with a goal of causing downtime. Have you ever seen an interesting article on Hacker News or reddit and you clicked on the link to find that the page doesn't load, it maybe throws an error? This is called a reddit hub. What happens is too many people are trying to go to that site at the same time and the server or servers are overloaded with too much traffic. It could be that too many requests are hitting the servers and using up all the memory, or it's taxing the hard drive, or some maximum number of sockets, or it could be something else. But the gist of it is that the server or servers aren't able to keep up with the demand. So how do you avoid something like this? Well, this is where scaling can come into play, either vertical or horizontal depending on the amount of traffic.
Next, we have natural disasters. If all of your servers are in a data center, in one place, let's say they're in San Francisco and an earthquake happens and cuts off communication to the data center then your system is going to be unavailable. Natural disasters happen and so when we plan for high availability we need to consider this. There are different ways that we can deal with this depending on what you've identified as your availability requirements. Now, either way it's going to result in the need for setting up an environment in a different geographical region. This environment may be always up and running and syncing a read-only copy of the database or it may just be that you have a disaster recovery plan in place so that you can easily get a backup of your production environment running in a new region. Now the option you choose will depend on your specific uptime requirements.
Next, we have hardware failure. It doesn't matter if you're using your own hardware or it's in the Cloud and somebody else manages it for you, if hardware fails then your system can too. So this is where redundancy comes into play which means you have a backup of components that kick in to replace broken components. Redundancy happens at all layers of technology. Servers may have redundant power supplies, hard drives, et cetera. Our system may use a database and that database may have a redundant database that's synced up and ready to become the primary should something happen to the current primary.
We often use multiple data centers inside of a geographical region to serve as a redundant copy of that system that allows us to prevent a single point of failure. Removing single points of failure is important. If that one component breaks then the entire system can go down.
Next we have malicious users. In regards to availability, malicious users can cause an overload of a component like we talked about earlier in cause a denial of service. Preventing a service disruption from a denial of service attack can require some planning before you implement your architecture. You need to start by reducing the attack surface. The attack surface is a way of saying the systems and components that are publicly facing and can be attacked. You do this through some form of firewall, virtual private Clouds offer this sort of firewall for you in the Cloud.
Depending on what's being attacked a web application firewall may also help to mitigate a flood of HTTP requests. And you'll need to be able to scale your system out so that it can accommodate the additional load. There are different services that can help you mitigate these denial of service attacks in the Cloud, things like Route 53 and CloudFront on AWS. An app engine allows for blacklisting IP addresses and subnets on Google Cloud so there's a couple of Cloud options that can help you.
So we've covered the same five common reasons that we talked about previously for systems to become unavailable, only we talked a little bit about how to combat them. Let's go through some designs and talk about the availability of each.
In this example we have a single web server connected to a database server and both the web and database server are directly exposed to the world, so we have a couple single points of failure. Now that may sound confusing because we're saying that there are multiple single points of failure and it may sound like that can't happen. However, the word single is in reference to a single component and it means a component that doesn't have a backup. If either the web server or the database server was to go down then our system would be unavailable. Now, depending on our requirements this may be okay. Remember, it is a business decision to determine the ideal availability. So if this configuration meets the availability, budgetary, and security requirements then this may be all you need. Assuming that the web and database servers are large enough to handle the expected load then we need to know if this design does meet our uptime requirements.
So we start by determining the availability of the underlying system. So let's assume that the hardware for our web server has an uptime of 99.95% and our database server has the same. So without considering the potential downtime from system updates and deployments we're at an estimated 99.9% uptime which means because we need both systems to be up and running for our system to be available we've added roughly 43 minutes of additional, potential downtime per month because both systems need to be available so we can calculate the estimated availability by multiplying the two availability numbers together. So .9995 times .9995 is .99900025. 99.9 is around nine hours of potential downtime per year. What if we required a better base uptime?
Let's start fixing up this design and see if we can't improve that uptime. We need to add a redundant web server so that we have a backup and so we'll start by adding a load balancer. For the sake of this discussion we're going to assume that the load balancer is capable of scaling to handle the load as needed so we'll kind of be ignoring it a bit. And we'll be adding a second web server and again, we'll assume that it has a base uptime of 99.95%. So now, the base uptime of our web servers has gone up to .99999975 by having the two of them because each individually has a .0005 chance of failure. So the odds of both of them failing is .00000025. Now, if we recalculate the availability of this system we can see that we're at roughly 99.95% uptime. So we were at 99.9% availability and now we've increased it to 99.95 by adding an additional web server.
We're now looking at a base downtime of roughly four hours per year. But we still have a single point of failure with that database. So let's take another pass and try and make it an even better uptime. so here we've added a read-only replica that will kick in only if the primary database fails. So our database is no longer a single point of failure and we end up with an estimated base uptime of 99.9999%. Up until now we've been talking about this stuff at a generic level so there are things we haven't considered in our design. Things like, what happens if a server does go down? How do we get it back online or how do we get a server to replace it? The same goes for the database server, if it goes down, what happens next?
What happens if our data center goes down due to some natural disaster? Or what happens if our site goes viral and we're suddenly seeing 10 times more traffic? So what we've done is architect a system that looks okay on paper but it's not great. so for noncritical, low traffic sites we could possibly include a disaster recovery plan and call this good. However, for systems that need to be highly available we need to do a bit more work so let's keep building this out.
In this example, we're going to be offline if something happens to our data center so we can add in an additional level of availability by deploying to a mirrored environment. Now the mirrored environment could be either a cold, warm, or hot site. The word site in this context refers to a data center.
Here's the difference between them. A cold site is basically having access to a data center and it's used for recovery reasons.
A warm site is similar to a cold site except that the data center will contain hardware required and is just waiting for you to load your software and data and use it. This allows you to get up and running faster than a cold site.
And a hot site is all set up with the hardware and software and you just need to switch over the traffic to it. When talking about the Cloud this is a little bit different because the Cloud providers abstract away the hardware.
A cold environment basically just means that you have your code, assets, and data backup so that you can redeploy them in the Cloud to some location. A warm environment in the Cloud often means that you have the infrastructure configured, however, it's not ready to handle the amount of traffic from the production environment yet. Maybe you need to start up more VMs or get the data loaded from the production environment. So it's not an immediate switch over but it's much faster than a cold environment because the infrastructure is already configured. And a hot environment is one where everything is set up. The data is always synced and it will be able to handle the traffic when you switch it over, so it's ready to use at any time. And this is another area where it's going to depend on the requirements of the system whether you go with cold, warm, or hot. The more mission critical a system the higher the availability it will require and that means we'll need it to be more resilient.
So let's mirror our environment. This is similar to our previous design except we're now going to use a separate data center to run our mirrored environment. Notice that this environment on the right doesn't have the same number of servers as the primary environment on the left. That's because this is a warm environment, meaning that you can switch over traffic to the back of the environment should you need to but it's not going to be able to handle the load immediately, it may need to spool up some additional servers first. This is useful when you need to minimize the risk of a data center outage but don't have the budget for a hot environment.
Here, we have a hot environment. It's a mirror of what's in the primary environment and it's just sitting there waiting to have traffic routed to it. Again, it depends on the systems requirements which one you want to build. You may notice that these two data centers are located in the same geographic region.
Now, if there was a natural disaster it's possible that they're both going to be out of commission. Depending on what happened, it's possible having your system unavailable is not the priority. If your data centers are down because aliens came to earth and destroyed them then you probably have other things on your mind. However, if maybe a small earthquake happened that disrupted your data centers connectivity then the user base from outside of that area may not expect to see a disruption in their availability. So, that's where leveraging multiple regions makes sense. Again, this could be a cold, warm, or hot environment.
Maybe having the ability to restore to another region, in a reasonable amount of time is acceptable. So, as long as you have a disaster recovery plan that doesn't involve those down data centers then maybe you're okay. Or maybe you want a warm environment like the one here, where you could route traffic to the other region but you're going to lose some of the requests as the new environment spools up enough resources to meet the demand. Or for something where less risk of an outage can be tolerated maybe a hot site like this one here is optimal and you use both environments, serving up requests to the users from the closest region and this will help avoid latency as well.
So with this design here we currently have a platform that will be highly available, though, these designs don't cover things like data replication, failover, network configuration, and recovery methods and that's because those topics start to go from beginner to intermediate. And it also starts to get into actual implementations. I want to keep this at a generic enough level that it will work across multiple Cloud providers.
The Cloud has made it easier to create highly available systems because we could push a button or enter a command on the command line and have an entire environment built for us. Cloud platforms make it so that we don't need to think about a lot of the challenges of running and maintaining hardware. We get to think about data centers in a more abstract way, terms like zone allow us to think about the concept without needing to understand how that data center is managed.
Cloud platforms also offer a lot of value due to the ability to use resources on demand. If you need extra servers to handle a spiky load in traffic then you can add them to the load balancer and then shut them down when the traffic dies down, saving you the cost of buying servers that you would only need to use on occasion. And Cloud services for file storage tend to be highly available and highly durable. So Cloud platforms offer a level of ease for highly available systems that is difficult to achieve without them. We're going to wrap up this discussion on availability by summarizing the key takeaways.
First, don't over engineer. You need to consider the real world constraints when architecting systems.
Second, when you need high availability avoid single points of failure with redundancy. This includes at the component level and at the environment level. By component I mean things like servers, databases, et cetera. And by environment I mean the complete system so having a mirror of that environment running in another location.
Next, the level of availability is a business decision not a technical one.
Next, understand the percentages. For example, you'll want to know the difference between 99.95% and 99.99% uptime.
The designs we made so far have all assumed that the number of servers that exist will support the traffic. However, what if your site goes viral and you gain 10 times the traffic? In our next lesson, we're going to talk about how to solve that problem with scalability.
So, if you're ready, let's get started.
Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.