Modern AWS cloud deployments are increasingly distributed systems, comprising of many different components and services interacting with each other to deliver software. In order to ensure quality delivery, companies and DevOps teams need more sophisticated methods of monitoring their clouds, collecting operational metrics, and logging system occurrences.
This course aims to teach advanced techniques for logging on AWS, going beyond the basic uses of CloudWatch Metrics, CloudWatch Logs, and health monitoring systems. Students of this course will learn:
- How to use logs as first-class building blocks
- How to move away from thinking of logs as files
- Treat monitoring, metrics, and log data as events
- Reason about using streams as log transport buffers
- How CloudWatch Log Groups are structured internally
- To build an ELK stack for log aggregation and analysis
- Build Slack ChatOps systems
If you have thoughts or suggestions for this course, please contact Cloud Academy at email@example.com.
Welcome back to Advanced Amazon Web Services Monitoring Metrics and Logging on CloudAcademy.com. In this lecture, we'll be talking about how we handle distribution and the distributed nature of stacks that we need to log and monitor.
First, we'll talk about how we need to stream everything to make this goal of handling distributed event sources work correctly. We need to understand the log group structure. We need to talk about how we perform unification and distribution of our streams and log events into multiple consumers. We need to talk about different kinds of stream syncs that we can use when we're talking about creating our log streams and how they can be useful to syncing data or transactions into individual syncs. We'll talk a little bit about how you might perform an archival and backup solution, so how we might achieve logically the long-term auditability requirements that you may have rather than using the logs as an in-flight or real-time log streaming. And then we'll talk briefly about aggregation with the ELK stack and how that is logically achieved.
So, we're talking about handling distribution. We are talking about how we want to stream it all. So when we're talking about streaming in an Amazon Web Services cloud, now we say we need to stream it all and that's because streams are helpful transport mechanism as we learned in the other lesson. But when we're thinking about streaming logs specifically in Amazon Web Services, we should be thinking of Kinesis CloudWatch Log Streams and how they are your friend and we're going to use them copiously as we set up these logging systems.
So first of all, we need to log everything, truly everything to CloudWatch Logs first and use this as a source of truth. So it's a common temptation to log to disk because it's easy and convenient and it's nice and easy or log to standard out even. If we're logging to standard out, we should be using a CloudWatch Logs daemon to take each of the lines that we're logging to standard out and pick them up and emit them to CloudWatch Logs which Amazon provides a tool that lets us reroute anything that's sent to standard out or to disk to CloudWatch Logs which runs as a daemon on Amazon Linux, Ubuntu, or their CentOS distributions. So, rather than just using a plain file, we should be thinking about using a daemon like that and submitting to CloudWatch Logs first. And if we're using anything like a Lambda or something, we will natively support that CloudWatch Logs.
So all processing should be done stream with CloudWatch. This is an important notion that the top of our log stream funnel should be starting with CloudWatch just because in Amazon we have the ability to do long-term retention if we're streaming into CloudWatch. But we also have the ability to read off of it like an inflight stream that might be more transient or temporary. So you get the best of both worlds when we're thinking about using CloudWatch Logs because we can get long-term persistence as well as event ordering and event buffering through the streaming behavior.
So, when we're thinking about how logs work and we're CloudWatch and how log model works, this actually works the same model, this diagram that we're looking at. This is actually how Apache Kafka works and CloudWatch is in some ways similar to Kafka in that it has similar partitioning and such. So let's take a look.
So we have... CloudWatch as a service encompasses this entire diagram. But these two inner rectangular shapes that we have, we see one log group and another log group on the left and right here. Log groups are the logical grouping of our logs. So, Amazon Lambda already creates log groups based on the Lambda name and names spaces them under the AWS Lambda. You can also create log groups by putting a log group when you're running an easy two instance if you name the group along the same vein. So for instance, I might decide to use one singular log group for each process that I might be running inside of an auto-scaling group. So if I'm running two processes at a time, two primary processes at a time on each instance in an auto-scaling group, I might use two log groups, two log groups that are used from each system.
So, log groups are great but they're not the unit that you read from. They're just the logical unit of abstraction. When we're thinking about actually pulling data off in those time sequenced recordings like we were talking about in the other lecture, only the log streams which are belonging to these log groups are guaranteed to be in order. So there's no absolute ordering of events inside of a log group at all. There is absolute ordering inside of a log stream. So, that's a key distinction because we might have multiple parallel streams being produced at the same time but not have any way to globally order them if we don't create a system on top of it. So, if you think about CloudWatch Logs and the model that we have, we have there are many CloudWatch has many log groups. Log group has many streams. Stream has many events. And inside of each stream, the events are guaranteed to be in order as you can see here.
So, when we're thinking about having these multiple logical streams, it's important to remember that if we're trying to configure a useful streaming application for logs where there may be systems that need to consume multiple streams or there may be systems that need to consume streams in a different format, than they're used otherwise, then we have these two different operations that we can perform which are our primary method of working with log streaming system design. So, on the left here, we have unification where we have the capability to unify two streams into a single stream and publish them to a consumer who may be considered, who may be concerned with the correctly global ordered streams, that is stream C, that is the unification or interleaving of A and B. We may be not concerned with the global ordering. We might just be concerned with the mixing two topics together. So those could be API and database logs. This is actually how you might merge two streams within a log group into stream C. So presumably, we could imagine if we stream into Kinesis from two different streams inside of the same log group, then that stream C there might be the Kinesis or the unified log which represents the merged events from all streams inside of a group. And if we have a consumer that is interested in reading all events from that entire group, then consumer A will be happy with stream C.
We also have this ability to do a transformation with a map or a filter or both at the same time, filter map or a map filter. So, if we see we have this stream before and we run through some sort of logical process and emit into a stream after. In Amazon, this is typically achieved by reading out of a Kinesis Stream running a funk door on a Lambda on shards of the events coming out, shards being a frame of multiple events coming out at once as we optimize for network throughput. So for instance we wouldn't want to, if we're streaming a million records every hour, we wouldn't necessarily want to read out of the stream one record at a time. We might want to chunk 10 at a time, that would be a shard.
So these map or filter operations here, we can submit or we can have a Lambda read out of the upstream stream, execute some sort of mapping or filtering logic and put it into a second stream. When we combine these two methods, this unification and this distribution through a map or filter, this transformation, we combine these two and realize that we can chain them together. We can design systems of arbitrary complexity because we can also have multiple consumers per stream, right? So key concepts, merge and alter.
But there's this third concept where we have stream syncs and we can sync the output of a stream into multiple different places here rather than showing a fanout of any streaming, typically fanout happens where have an end consumer rather than fanning out into two individual streams. So if you see here, potentially, this is an example that we could use where we have an application database that's receiving application-style reads and writes in a transaction log. Now that transaction log, we can actually use a stream as our transport mechanism. So the transaction log only needs to be aware of the stream location. So, app database only knows where to submit its stream. And then our ElasticSearch, Free-Text search, our Hadoop, Ad-hoc analytics, our Redshift for analysis and business intelligence SQL, our S3 and then subsequently Glacier backup, our replica application database even since the transaction log represents a changed stream that we can use to recreate a replica or anything else, we only need to know where the transaction log and that where is that data that we're curious about exists. We actually don't need to know where the application database lives. That way if we have any of the databases go down either upstream or downstream, the only piece that needs to have a consistent address that needs to be addressed is that transaction log.
So log streams are great for replication if we see the app database versus the replica app database. The log is actually the ideal method for replicating any kind of state across two different databases simply because that time ordered sequence of events, if you replay it on the other database, you get an exact copy and you can also use the transaction to recreate the secondary database at any point in time since you can only partially play forward if you want if you restore the database.
So, common question is if we're not looking at logs as files anymore, that we go and peruse through when we need to do some sort of lookup in the past to see what went wrong, how do we do archival backup and auditability when we're talking about this brave new world of log events and streaming? Well, rather than just looking up files and doing direct S3, simply look at the S3 objects as another sync. So, again, we can have our database creating a transaction log stream, maybe publishing to Kinesis, and we can have a Lambda reading off of Kinesis and writing an S3 object for each shard. So we have logs that are divided by time and we have many objects over time. We can also then still have the other consumers read from the same transaction log stream. So the archival and auditability actually is no longer a special case at all. It's a primary case where it's just another sync. It's just another reader off of the same stream.
Once we write into the S3 objects, we could set life cycle rules on the S3 bucket object and you can go and look up the documentation in Amazon Web Services if you so choose. But a life cycle rule effectively tells S3 to change the storage class of data after a certain amount of time. Storage class changes can include altering to this AWS Glacier which is storage on magnetic tape which is at the time of this recording of this video it's seven-tenths of a cent in US dollars in the primary regions, the US East and US West versus three cents for the S3 storage. So it's four times cheaper to store in Glacier, so we might imagine that after a certain period of time, it would be advantageous for us to simply life cycle rule into Glacier.
But the important piece here is that the S3 object writing is the part that finishes dealing with the streams and then we batch after that. That portion there is not a special case anymore in this archival and backup in auditability when you're using streams. It's just another consumer. So, it's a very clean design and consistent. If you really want files, you should do it this way.
So, if you think about how we do event sources and consumers, we can also realize that we can make the ELK stack another consumer. So the ELK stack stands for Elasticsearch, Logstash, and Kibana. Now Elasticsearch is a free text search database and arbitrary query analysis, noSQL engine. It's typically used for free text search which lends itself really well to doing kind of lookups or forensic work on your cloud system because you can do a free text search for the error message or a reference code or something and see every time that a reference code or error messages appear in the entire history of your log stream, so E is great in ELK.
L stands for Logstash and it's our indexing mechanism. It is the way that data is flumed into Elasticsearch typically. Since that part is pretty much abstracted away by CloudWatch Logs and Lambda, we don't really think about that too much when we're rolling our own ELK stack solution on AWS but that's what the L stands for.
K is Kibana. So the E in ELK for that search engine in Elasticsearch service, that's just an API-driven database that speaks JSON as its wire protocol and it uses HTTP. The K there stands for Kibana. Kibana is a system that uses, it's just a graphical interface and an indexing system and some prebuilt logic and prebuilt indexes logics on the cluster. So it runs on the Elasticsearch cluster itself and it's a GUI that lets you do things like create graphs, do log analysis, and general business logic on the index logs inside of the E portion, that Elasticsearch portion. So K, Kibana is what adds the GUI and the credibility value.
So, if we look at this entire flow chart here, it's the same thing that we've been seeing with those white boxes early on in the slides, only this has some exact services named to it. So, the Elasticsearch, yet again, is just another sync. So we have our database or presumably, this would be something like a DynamoDB or even a SQL database if you wanted to implement your own transaction log scraper. In this case, I've used the Dynamo logo. DynamoDB can submit its changes to this stream, change system, so there's DynamoDB Streams is something that is supported now. In effect you are turning on what looks almost exactly like a Kinesis Stream that is populated with all of the events from a change stream off the database. So it's almost exactly like a transaction log.
You can have a Lambda poll on that transaction log or that change log so we can have a Lambda reading off the change stream. Then you could emit... Then whenever you have these changes occurring, you can have the CloudWatch log streams that are created by Lambda by default, in this case, we can have those logs automatically indexed by CloudWatch. That's a default thing if we set the role policy correct to allow Lambdas to create the logs. CloudWatch, we can also configure to stream into another Lambda. So we can stream into this Lambda off of our CloudWatch Logs and have the Lambda insert into the Elasticsearch service in the correct way for Kibana to operate efficiently. That is actually a console action that is handled for you.
And then we have a fast, searchable, united graphical logs UI which is excellent for Ops management. So this one of the more common syncs that people think of because it uses logs for the same utility that people are used to thinking about using them for, for debugging. This is just the extremely sophisticated way to debug an entire cloud because you don't need to submit only the API Lambda logs or the database transaction logs. You don't need to submit only those things to the Elasticsearch service. We can also submit logs from anywhere else in the cloud and still have them aggregated in a centralized place. So not only do we get unification of the multiple streams inside of log groups, but we can also aggregate across log groups and then create pretty charts and different analytics and metrics based not only on hard numeric metrics that CloudWatch natively support. But also, we can create arbitrary metrics off of a query that might be derived from working over the JSON representation of different logs.
So next we're going to do a little hands-on demonstration and try the ELK stack out by creating one inside the console.
Nothing gets me more excited than the AWS Cloud platform! Teaching cloud skills has become a passion of mine. I have been a software and AWS cloud consultant for several years. I hold all 5 possible AWS Certifications: Developer Associate, SysOps Administrator Associate, Solutions Architect Associate, Solutions Architect Professional, and DevOps Engineer Professional. I live in Austin, Texas, USA, and work as development lead at my consulting firm, Tuple Labs.