What happens once your software is actually running in production? Ensuring that it stays up-and-running is important. And depending on what the system does, and how much traffic it needs to handle, that may not be particularly easy.
There are systems that will allow developers to run their code and not need to think about it. Platforms as a service option like Google’s App Engine go a long way to reducing and, in some companies, removing operations. However, not every system can or will run on such platforms. Which means that having qualified operations engineers is an important thing.
The role of an operations engineer is continually evolving; which isn’t a surprise since changes in technology never slows down.
So, if the job falls on you to keep a system up-and-running, where do you start? What needs to happen? These are the questions this course aims to answer.
In this course, we take a look at some of tasks that operations engineers need to address. I use the term operations engineer as an umbrella, to cover a wide variety of job titles. Titles such as ops engineer, operations engineer, site reliability engineer, devops engineer, among others, all fall under this umbrella.
Regardless of the name of the title, the responsibilities involve keeping a system up-and-running, with little or no downtime. And that’s a tough thing to do because there are a lot of moving parts.
If you’re just starting out, and are interested in one of those roles, then the fundamentals in this course may be just what you need. These fundamentals will prepare you for more advanced courses on specific cloud providers and their certifications.
Topics such as high availability are often covered in advanced courses, however they tend to be specific to a cloud provider. So this course will help you to learn the basics without needing to know a specific cloud provider.
If this all sounds interesting, check it out! :)
By the end of this course, you'll be able to:
- Identify some of the aspects of being an ops engineer
- Define why availability is important to ops
- Define why scalability is important to ops
- Identify some of the security concerns
- Define why monitoring is important
- Define why practicing failure is important
This is a beginner level course for anyone that wants to learn. Though probably easier if you have either:
- Development experience
- Operations experience
What You'll Learn
|Lecture||What you'll learn|
|Intro||What will be covered in this course|
|Intro to Operational Concerns||What sort of things to operations engineers need to focus on?|
|Availability||What does availability mean in the context of a web application?|
|High Availability||How do we make systems more available than the underlying platform?|
|Scalability||What is scalability and why is it important?|
|Security||What security issues to ops engineers need to address?|
|Infrastructure as code||What is IaC and why is it important?|
|Monitoring||What things need to be monitored?|
|System Performance||Where are the bottlnecks?|
|Planning and Practicing Failure||How can you practice failure?|
|Summary||A review of the course|
Welcome back to Introduction to Operations. I'm Ben Lambert and I'll be your instructor for this lesson.
In this lesson we'll talk about what it means to have a scalable system, and we'll talk about why you need to scale. So, what does scalability mean? Scalability is the ability for a system to grow as needed to maintain its performance under additional workload.
Well, what does that really mean? Imagine it's a beautiful day and you're spending it in the park with hundreds of like-minded people wanting to enjoy nature. An ice cream truck pulls up and starts serving ice cream. However, everyone rushes to get some ice cream, and hundreds of people are now queued up and waiting. This is not scalable at all. One truck to serve all of these people. People towards the back of the line may just leave, and by midway through most of the good ice cream has already gone. So, you'll have a lot of disappointed people. Now imagine that for every 10 people a new truck pulls up.
With just 10 people at each truck it's unlikely that they're going to run out of ice cream, and there's no one that's going to wait too long to have theirs. So, this is a form of scaling. We're adding more trucks to support the demand. Of course, this is a silly example. It's just not practical. However, I find that silly examples stick in my mind and help me to remember things. So hopefully it's the same way for you.
Now a real world example would be something like Netflix. They serve up so much data, and all over the world, that when they release something new they may need to add new servers to the load balancer to handle the increase in traffic. And since they're leveraging the AWS cloud they probably have it automatically scaled via something like auto scaling groups. There are different ways to scale. You can scale up or you can scale out, also called vertical or horizontal scaling.
So, what does it mean to scale up? We'll go back to our earlier design where we had just one web server and a database server, and we'll use that as an example. Let's say we're running a fairly small server, a single CPU with two gigabytes of RAM. If we're running a small site, then maybe that's fine. However, if our traffic starts to increase, then we could use a larger server to handle the increased load. This is a common thing to do, especially for systems that will grow slowly or are running legacy software that may not scale out easily. If you're running some internal software, something like a time tracking system with infrequent traffic, then scaling up may be the best choice. Scaling up is a viable option. However, there is a cap. You'll only be able to scale up to the largest server you can find. 40 CPUs and 160 gigs of RAM is a pretty large server, and a pretty costly one too. So, if you're going to scale up you'll need to consider that at some point you'll be running on the best hardware you can find and you won't be able to grow anymore. What happens if you do hit that cap, or you just want to ensure that it won't?
That's where scaling out comes into play. In our ice cream truck example we were talking about scaling out. We add additional resources to handle that load. If you were running a high traffic website then maybe you'll need to scale by adding additional web or application servers to handle the load. This would allow the traffic to be distributed across more servers, and it won't tax any one system. The ability to scale out doesn't just happen though. It's something that needs to be supported by the underlying technology stack. You need to have a system that doesn't have any state stored on the web or application servers. If you have a system that allows for file uploads, then you need to make sure that they're saved with central location so that all servers can access those files. The same goes for session state. If you're using local session state then you need to switch to something more centralized. When thinking about scaling out, you need to know if the application and tech stack will support it.
One way to answer the question, can we scale out, is to ask what would happen if we terminate the server running our app. If the answer is something like, we'll lose the user uploaded assets, then your developers need to address that. If the answer is more along the lines of, we'll need to deploy a new server with the latest version of the app, then maybe you're ready. You need to make sure that if all servers need access to something, it's centrally located.
Creating highly scalable systems isn't impossible without the cloud. But for most of us, the cost to create scalable systems without the cloud is just too high. Cloud platforms offer mechanisms to scale pretty easily without having to really think about it. If you need to go from 10 to 100 servers, cloud providers won't even blink. AWS, Azure, and Google Cloud all have auto scaling functionality. This allows you to scale out based on some metric such as CPU load, and back down when the load dies down.
So, if we look back at one of our previous designs, we can see that with some sort of auto scaling for the servers we now have a system that will have a pretty high up time and one that will handle the traffic as it comes to us, at least for the web servers. The database may or may not be able to keep up, depending on several factors that we won't go into in this course, but include things like the size of the server, whether or not the database is replicated to additional instances, among others. So scalability is a feature that allows us to create highly available systems because when we need additional compute resources we can just add them. And just as important, when we no longer need those resources we just remove them. Now that we have an understanding of what it means to have an available system, and to use scaling to improve that availability, we should talk about some of the security concerns that operations needs to think about.
And that'll be the subject of our next lesson. So, if you're ready, let's dive in.
Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.