Modern AWS cloud deployments are increasingly distributed systems, comprising of many different components and services interacting with each other to deliver software. In order to ensure quality delivery, companies and DevOps teams need more sophisticated methods of monitoring their clouds, collecting operational metrics, and logging system occurrences.
This course aims to teach advanced techniques for logging on AWS, going beyond the basic uses of CloudWatch Metrics, CloudWatch Logs, and health monitoring systems. Students of this course will learn:
- How to use logs as first-class building blocks
- How to move away from thinking of logs as files
- Treat monitoring, metrics, and log data as events
- Reason about using streams as log transport buffers
- How CloudWatch Log Groups are structured internally
- To build an ELK stack for log aggregation and analysis
- Build Slack ChatOps systems
If you have thoughts or suggestions for this course, please contact Cloud Academy at email@example.com.
Welcome back to Advanced AWS Monitoring Metrics and Logging on CloudAcademy.com. In this lecture, we'll be talking about how events are everywhere and how this relates to logging.
First of all, monitoring metrics and logging, they're all events. We'll talk about how we can use metrics and how they're slightly different than normal logs. We can talk about how logging is everything and the value that it holds. We'll go over event sources and consumers, so different things that are creating events and different things that are reading or seeing these events come across and acting on them, and how streams solve a lot of the problems associated with managing these logs. Finally, after going through all of the logic in the previous slides, we'll talk about why we're saying "death to the log file" as we eluded to in the introduction lecture.
So again, events are everywhere. Monitoring metrics and logging, they're all events. So let that sink in for a moment. Events are, "Hello. Boom. Something happened. Event." Monitoring is a typically binary, "Oh, holistic check. Is my system online?" This can be a DNS ping or HTTP check or something. Metrics are typically a quantifiable, time-sequenced thing, so metrics of usage, metrics of traffic, metrics of throughput. Logging is a little bit more generic and is the parent category to both of these, in that it can be unstructured or structured. So every health check, metrics collection ping, line of log content is an event, or a thing that happened.
You can imagine that there can be a timestamp associated with all three categories of these things, so we just think about it as recording an event or a thing that happened. Like normal analytics, what happens has value, so there's a logical jump here. If we think about what most businesses, or at least small businesses or companies that haven't undergone technology modernization efforts, most businesses think only about analytics from a marketing or even a physical operations perspective perhaps on a factory floor.
These business analysis, business intelligence, business analytics tools, these are a sophisticated ecosystem, but we're just seeing log analytics catch on. So if we think about events as different things that occur and we have analytics over events in the marketing space, we should also be able to do analytics over the logging space when we have log events.
When we think about metrics, we should think about this kind of graph where they're entirely quantitative and our metrics are best used for time-series algorithms and reaction. Because metrics are inherently quantified already, they have some sort of number associated with them, the metric that we like to think about, we can make decisions based on what we see.
So here we actually look at a DynamoDB consumed write capacity units throughput table, and this is a metric in the CloudWatch Console. This metric tells me that I just consumed about a quarter of a million requests in five minutes after a period of essentially no requests overtime. So metrics are useful because they let you do things like detect when something like that happens. Where I go from almost no requests coming across my system or making writes to my DynamoDB table, and suddenly one minute there was 0.25 million requests, and the previous 5-minute period there was over 200,000 requests. We want to be able to figure out when these kind of spikes happen, and we do that using metrics in the CloudWatch Metrics Console.
So logging is everything. Logging, the more generic field of things that we're talking about, where metrics are technically just log events with a well-quantified field to them. They still have timestamps. So business logic is loggable and maybe quantifiable, the quantifiable part being the metrics, and we can think about logs as append-only series of events in a flexible semi-or-unstructured format that can be quantified. That's a little bit of a doozy of a sentence to think about. But append-only, meaning, if you think about a log file, when you concatenate to the end of the log file you append to each new line that you come out with. Log files are just a file representation of a stream, like we're looking at here in the bottom-right. So we have an ordering where the top or left, it depends on the orientation of your log, in a file it'd be the top of the file, is the oldest event. The next record that you write to the log will always be guaranteed to be the newest record at the time. It's also append-only is very important here, when we think about it. We're not modifying 0 through 11 in that log event right there in the bottom-right. We are creating number 12, and that append-only property is very useful to us, as we'll see a little bit later here.
They're also best served with structure. That's my little joke there. I have a small JSON object that I've typed into a text editor. But what this tells us is that a metric is actually one of the more structured log types. It has an associated quantified piece to it. But all logs should be served with structure and consistent fields. So rather than just logging or printing text out to the console whenever you're trying to do your debugging, it's more helpful if you return the kind of error or the nature of the error, or any kind of details or parameters that were provided to a method when you log out an error.
When we think about logging, we should be thinking about how to move from free text, or unstructured format, to more structured formats since that's what computers and people will be able to do analysis over.
Note that I don't mention the storage medium here, that stream on the bottom-right there. It doesn't say that it's stored to disk, that log sequence. It doesn't say that it's stored to disk. It doesn't say that it's in a database. It doesn't say that it's in memory. It doesn't really matter, because logs can be transported or represented in a number of different ways. So don't think about files when you think about logs. Think about these sequences of events that may or may not have quantitative properties on them and should be structured if you're doing things correctly.
Talking about event sources and consumer, first of all, data availability is good. This is an age-old mantra of anybody that's ever done business intelligence or analytics from a data warehousing or data-like perspective that generally making more data available to more different parties is a good thing. Because we get better insights, better integration, and more actionable insights as we diversify the way that we can consume these things.
Metrics and log events are data. Right? So particularly if we think about from the previous slide, where I had a "Best": "Served" "With": "Structure" JSON object, if my logs are in JSON format there are very sophisticated and well-built out tools to help us do analysis over things like JSON objects. Because they're effectively a serialized representation of objects in memory, once we have our logs into JSON or some representation like that, then we can use them as first-class citizens or primary data sources.
So events are good, because we're saying if data availability and having more data about our business that's relevant is clean, if that kind of data availability is good and metrics and log events are data, then we should be looking towards our metrics and log event data as good, or sources of value for the business.
One example where somebody created an actionable system that actually delivers value just using logs, this is a metric. Which in my mind, it's a subset of the log and that's how you should think about it as well. Auto Scaling actually uses this method. So if we look at, we have an Auto Scaling group here on the left, we have a group of instances. If all of these instances are publishing metrics, which they are, to Amazon CloudWatch and we have a thresholding algorithm on them, which are our CloudWatch alarms. If you've ever done an Auto Scaling threshold alarm, that's one in the same. Then, we have logic that once we emit to the alarm and say, "Oh, I flagged a specific pattern in the log data, and I've come up with some analysis and some actionable thing that I need to do based on what I've seen," then we trigger the Auto Scaling and command something like an auto scaling system to scale up and down. So this is actually effectively how you would accomplish auto-scaling.
Beyond AWS, you know Amazon is very sophisticated and they have a very straightforward infrastructure¬-driven requirement to use a form of log to deliver value, we can also think about delivering value via a normal SaaS company by using log data to make self-managing systems. So let's trace through the user journey here. In the bottom-left we have a user or an end user, customer. The user can be a human being in the case of a web application, or it could be another software system in the case of more of a systems-oriented or a service-oriented architecture. They interact with your service, which I've represented as a collection of EC2 instances here, but it could be any number of complex things or stacks.
We produce business logs from the different components of our service, and we emit them over to CloudWatch Logs. We can pick them up from CloudWatch Logs using Amazon Elasticsearch Service, or ELK, which we'll get into a little bit later in the course. Based on thresholds coming out of that system, we can create custom alarms, notify an SNS topic, and have different actions be taken based on that SNS topic. We can write the SNS topic into a support database using a Lambda perhaps, or another system watching for this, and have customer support read out of a support database. We can page engineers based on different things that we see come out of the logs. We can also, if we have sophisticated enough logs that provide us with enough information, sometimes perform actions that solve the problem completely autonomously, which would be a healing Lambda automaton. It could also be EC2 instances as well. But we can implement self-healing logic if we have business logic logs that can flag things like increased error rates, or something like that.
So beyond just metrics that you're thinking about for throughput, we can also do more sophisticated logic like detecting patterns in the actual text or enumerated inputs inside of business logs, and create fairly simply, even though there's a lot of boxes and arrows here. There's lots of different value-added services that Amazon provides to us, such that we can string together one of these self-managing SaaS systems without creating our own instances or software simply by stringing together a couple different Amazon Web Services managed services.
So we have a complexity management problem already, just looking at that other slide over there. We think about strings as time-sequenced event buffers. We already talked about streams, and we already talked about events. Events are these different things that pop up in our logs, or they're the individual things that happen. Streams are entire sequences of these events. Streams also work as buffers, because if we have a stream that is the data intermediary to carry between two systems that might want to share log data, the streams will allow us to do a number of different things. They'll allow us to natively support log event style, so that one's clear. If we think about logs as a sequence of events and streams are simply time sequences of events, then this is a natural thing to want to start streaming our logs.
We also get a unified transport to other services from a stream. So if we think about what Kinesis is primarily used for, if you're familiar with the rest of the Amazon platform, it's primarily used for moving data in the correct order from one place to another. If we think about firehoses, if you've ever heard of that before, data firehoses, those are just streams as well. The firehose typically just means the velocity with which they come through. Then, three and four, they allow different consumption and production rates. If we think about if we have a system that is creating things, events very quickly or in a burst-y matter, and then another system that slowly DQs things in a constant rate, we can use a stream to buffer between those two systems as we have the spiky production and consistent consumption. Or even the other way around, where we might have constant addition to the stream, and then a spiky read off of the stream to do analysis and perhaps a spark, or write to a database.
It also allows us to decouple producer and consumer systems. So rather than having to keep a registry of all of the DNS addresses of all other micro-services in a complex system, we could have our log systems simply write to a stream that never changes location or DNS address. Write to that stream, and not concern itself with the consumers that are picking these things up.
We could also potentially have multiple producers writing to the same stream, and multiple consumers reading off of the same stream. So we can have a many-to-many relationship in which the producer only has to be aware of one thing, where the stream is, and the consumer only has to be aware of one thing. Even if we had 10 producers and 10 consumers all producing to and reading from the same stream, the only thing that any of those services need to be aware of is the stream location, so we have a nice decoupling there.
Streams are great, and we talked about why they're great. We need to realize that logs aren't just files that you open and scan when something breaks. They're a primary data transport mechanism, that is we can put the data that come out of our log streams and do a number of different things on them, even replicate databases. They're a primary first-class citizen for data, and log events are fundamental design building blocks.
So if we looked at my self-healing system, the primary thing that it was operating over were logs. All of my business logic was centered around the log, and even in Auto Scaling the primary thing that Auto Scaling does is use the design of the log system as the fundamental building block. It just reads out of this log stream, these time-sequenced events of load, and then make decisions for when to scale out. Hopefully, you've learned a thing or two about why streams and events are everywhere in a logging problem. Next up, we'll be talking about how to handle distribution, that is how to handle the distributed nature of log producers and log consumers, and in general how to write a distributed system that handles logs in the cloud.
Nothing gets me more excited than the AWS Cloud platform! Teaching cloud skills has become a passion of mine. I have been a software and AWS cloud consultant for several years. I hold all 5 possible AWS Certifications: Developer Associate, SysOps Administrator Associate, Solutions Architect Associate, Solutions Architect Professional, and DevOps Engineer Professional. I live in Austin, Texas, USA, and work as development lead at my consulting firm, Tuple Labs.