Advanced Techniques for AWS Monitoring, Metrics and Logging
1h 10m

Modern AWS cloud deployments are increasingly distributed systems, comprising of many different components and services interacting with each other to deliver software. In order to ensure quality delivery, companies and DevOps teams need more sophisticated methods of monitoring their clouds, collecting operational metrics, and logging system occurrences.

This course aims to teach advanced techniques for logging on AWS, going beyond the basic uses of CloudWatch Metrics, CloudWatch Logs, and health monitoring systems. Students of this course will learn:

  • How to use logs as first-class building blocks
  • How to move away from thinking of logs as files
  • Treat monitoring, metrics, and log data as events
  • Reason about using streams as log transport buffers
  • How CloudWatch Log Groups are structured internally
  • To build an ELK stack for log aggregation and analysis
  • Build Slack ChatOps systems

If you have thoughts or suggestions for this course, please contact Cloud Academy at


Welcome to's Advanced AWS Monitoring Metrics and Logging course. This course is intended for DevOps engineers that want to get the most out of monitoring metrics and logging and, in general, keeping track of the state of their cloud.

So in this lecture, we'll be going through an introduction of the course, which will include a definition of the three terms that make up the title of the course. We'll also go over the intent of the course and what you're trying to learn while watching this video series. We'll go over the scope of the content, so everything that'll be covered beyond just the intent of what you should learn, the benefits of advanced systems running for monitoring metrics and logging on AWS, the different lectures in the course that you should expect to see, and then a brief summary of the intent of the course in a final mission statement.

So without further ado, let's get into it. So first, let's define the term "monitoring". When we think of monitoring, we should be thinking of how we can observe and check the progress or quality of something over a period of time, and keep it under a systematic review. Now in Amazon Web Services, this means verification that your cloud works, and I've included icons for three of the main services that people should be familiar with for doing this. You may not have used all of them before, but you'll become more familiar with them over this course as we talk about how we might do these things.

That's the Route 53 icon in the top-left there. We can use Route 53 health checks, where we ping a certain DNS endpoint and do a holistic check to make sure that we get a 200 error code, or just a general heartbeat or response back.

That's the CloudWatch icon up there in the top-right, which can include CloudWatch Events, Metrics, and Logs, and as well as Alarms.

Then, we have the ELB icon there. There are ELB health checks for when we're thinking about doing auto-scaling, and making sure that individual instances behind a load balancer are healthy.

So we should also look at metrics. When we think about defining metrics, we have a standard or a system of measurement. So in AWS, this means quantifying cloud behavior and state. So standard or system of measurement here means that we have a specific thing that we're measuring whenever we think about a metric. It's a little bit different than monitoring, where we are more looking for a binary "is it online or not" over a course of time. Whereas, a metric we're looking to quantify something using CloudWatch, which is that icon there. So quantifying our cloud behavior and state could include things like reading the amount of traffic that comes across EC2 instances or an Elastic Load Balancer, reading the amount of provisioned throughput or consumed throughput for DynamoDB tables. Any number of things that we can put numbers to and are nicely graphed over time, that's what you should be thinking about for metrics.

So we also use metrics as input for determining CloudWatch Alarms, which is that icon in the bottom-right there, where that red presumably would be a threshold that I cross. That's one of the reasons that I would be using metrics, is to see whenever I cross certain thresholds and create certain behaviors.

So moving on to logging, this one might surprise people, recording of performance events or day-to-day activities. So in Amazon Web Services, this means a time sequence of system occurrences, which is very generic. It's not like you might think, where most people coming from a non-cloud background or from a cloud that doesn't have sophisticated value-added services tools to help us manage our logging, might think of logging as dumping out files that represent things that happened during your application code. Which, that's a useful abstraction for maybe a single desktop computer, but it's extremely difficult to handle something like that when thinking of log files as just individual files. Not just logging in general, log events, performance events, or day-to-day activities. It's very difficult to manage that level of complexity in the cloud, so that's why you're watching this course.

So intent of this course is enable actionable understanding of your cloud, which is a very generic statement. But what that means is that we go from thinking about logs as afterthoughts, or things that we might use for debugging whenever things go wrong, and move forward towards this after, where monitoring metrics and logging is a first-class design task that delivers massive business value. So what does that mean? Well, it means that when we implement correctly monitoring metrics and logging, as a DevOps engineer, one of the primary things that you should be praised for is the level of sophistication for your monitoring metrics and logging system. By making it a very easy system to extract value from, answer questions about the operational metrics of a system, and ensure high availability, ensure good software delivery practices. So monitoring metrics and logging is very important, and as we move towards this value-added thinking around monitoring metrics and logging, rather than afterthoughts, we'll be better DevOps engineers.

So the scope of our content is that we fundamentally rethink the log. So as we eluded to earlier when we defined the log, we need to in this course fundamentally rethink the log away from a set of files, maybe, that you might run on your virtual machine or bare metal if you're coming from a data center that you actually have.

We have to think about logs and metrics, and how they offer value. So one of the statements that I made on an earlier slide was that we are going to turn these things that are typically afterthoughts into value-added systems. So we have to cover how we can do something like that and extract that value. So we'll learn the skills to extract the insight from the logs. When I talk about value, typically the type of value that we're thinking of when we talk about using logs to get values insight, so we have three levels of completeness of information or data. So we have data, which is just that might be individual lines in your log. We have information, which is a slightly higher level of abstraction where I can speak an English sentence and explain what the data means. Then, insight, which is where I take away some critical thing that I didn't know before, or I've learned something new from that kind of information or data. So we want to be able to use logs to extract new insights that we've never thought of or seen before, and deliver value that way.

So we want to learn practical methods for handling logs. If we're going to be messing with all these logs and these metrics, which are really kind of just a subset of logs, then we need practical methods for handling these things and handling complexity. Because as you know in a highly dynamic cloud environment where we've got lots of distribution and lots of moving parts, sometimes the challenge can just be managing complexity.

We'll design some automation around log event streams. So you should know ways, in addition to having human eyes derive insight from logs, the insight that you can receive from logs if they're structured can also derive automation. So we'll get into that a little bit later. But there are a number of places where you can read or sift through logs, and depending on what you see, do automation actions. So we're going to get logging superpowers, effectively.

Monitoring and metrics are also in the title of the course. But I like to think about those as subsets of logging in general, which is the more generic thing where monitoring is a little bit more binary in the context of the cloud, where we're monitoring for uptime. Metrics are a little bit more oriented towards quantitative goals.

So what are the benefits of some of the advanced systems that we'll be building? Advanced logging helps manage systems in a number of different ways. Logging techniques should scale the business. We should yield immediately the convenient insights that we were talking about earlier, and we will reduce the ongoing DevOps effort of managing our cloud.

Finally, when we look at the lectures in our course, we have events everywhere handling distribution, try the ELK stack and ChatOps with Slack. So what that gets us is events everywhere. We'll be talking about how to rethink the log in general, that it's changing our brain around the paradigm shift that is going from thinking about logs as a file-oriented system into this events-driven system. Handling distribution we'll get into a little bit around the nature of the systems that we'll use to manage the complexity around delivering these insights and this additional value from logging systems. Try the ELK stack will be a show-and-tell, where I walk through a very, very common, the most common actually, logging extraction and insight extraction tool on Amazon. The ELK stack is now offered as a service from Amazon, and ChatOps with Slack. So ChatOps eludes to when you present automation with your system. Where if system events occur, rather than emailing you or phone-calling you, since most people spend a lot of their time in chat and chats are effectively an event stream as well, we can insert log notices into our chat system. I've picked Slack, which is an enterprise chat system that got really popular in 2014 and 2015, where we can do easy API-driven development to insert insight into our group chats.

So when we think about doing these different lectures, we should be thinking about all of these different graphs and insights that we might be deriving. We will mostly see some of these in the ELK stack, but start thinking about your logs as a tool to derive insight and do analytics, which is what we have all these graphs on here on the side.

So in summary, this course will teach you how to extract value from observing your AWS systems. It's very generic-speaking, but effectively it means we first rethink the way that we are handling logs in general, and start packaging them and utilizing them in a different way. Then, design appropriate technical systems to handle the new format of the logs that we're going to deal with in the best way possible, and make them usable and consumable by other systems as best possible, and usable and consumable by both technical and sometimes even nontechnical users. Then, we'll go over some practical examples for how we can actually utilize these things so you can envision implementing these in your own systems or your own company.

So next up, we'll be doing Events Everywhere, which is the video course that'll teach us how to rethink how we do log systems.

About the Author

Nothing gets me more excited than the AWS Cloud platform! Teaching cloud skills has become a passion of mine. I have been a software and AWS cloud consultant for several years. I hold all 5 possible AWS Certifications: Developer Associate, SysOps Administrator Associate, Solutions Architect Associate, Solutions Architect Professional, and DevOps Engineer Professional. I live in Austin, Texas, USA, and work as development lead at my consulting firm, Tuple Labs.