Welcome to Pizza Time!
Deploying the First Iteration of Our Business Application
Adding Analysis, Monitoring and Cost Management
Pizza Time is Going Global!
In this Group of lectures we will introduce you to the Pizza Time business and system requirements. Pizza Time requires a simple ordering solution that can be implemented quickly and with minimal cost, so we will do a hands deployment of version 1.0 of our solution. To achieve that we are going to use a single region and deploy application using AWS Elastric Beanstalk. We will implement a simple database layer that will be capable of delivering on these initial requirements for Pizza Time.
It turns out that the Pizza Time business is a success and the business now wants to target a global audience. We discuss and design how we can increase the availability of our initial Pizza Time application, then begin to deploy V2 of Pizza Time - a highly available, fault tolerant business application.
Hi and welcome to this lecture.
In this lecture, we will have an overview about monitoring with CloudWatch. I will talk briefly about CloudWatch itself, we are going to define a few concepts, and then we are going to have a hands on demo on EC2 monitoring, RDS monitoring, S3 monitoring. We are going to talk about CloudWatch alarms, logs and events. So, CloudWatch is a monitoring tool. I guess that there is nothing new in here for you. So let's go further.
To really understand CloudWatch we need to define three important concepts in here. We need to define Metrics, Namespaces, and dimensions.
So, the first and the most basic concept that we need to define here is the Metric. The Metric is basic and information about something, in this case it is the Number Of Objects, but there is not much sense in here because this metric only will tell us a numeric value in this case in a given time. But objects of what?
We need to have a namespace in order to put some sense into a metric. So right now we can see that there is more sensing here. We have the S3 namespace and inside the S3 namespace, we have a metric called number of objects. That will make more sense for us. The same thing happens with other surfaces. We have for example another metric, it is a CPU utilization metric which will hold the information about the CPU utilization but the CPU utilization of what because we can have more than one compute services on AWS and for services like RDS, which are not really compute services we also have the information about the CPU utilization. So again we need of a namespace to put some sense into this metric.
So in this case, it is the EC2 namespace. So we define it the most basic concepts in here. We have the metric which we will hold a point time metric. It will hold a timestamp and it will also hold a value.
We have a namespace. In this case, we are using the AWS namespaces. These namespaces are created by default on your AWS account. If you are using EC2, it will create the EC2 namespace and it will create the S3 namespace if you are using S3, but that's really not enough because if you think about EC2 you can have a lot of instances and how would you organize those metrics because you have the CPU utilization metric for every single instance that you have and that will be a little bit complicated to really separate those metrics from each other. And that's when dimension comes in place.
The dimension will hold more characteristics about the metric itself. For example in this case, it will hold the instance ID related to the CPU utilization metric so by using dimensions you have the ability to separate the CPU utilization metric from each of your instances. And we will see when we talk a little bit more about custom metrics that we can put even more information on a dimension.
We could aggregate metrics for all our production service because maybe we are putting things behind in autoscaling group and we don't really care about a single instance but we care about that entire group of instances.
So we can aggregate that data inside a bigger dimension, telling it, “these are the instance working on production and these are the instances working on the development”. The same thing happens for S3, since that doesn't make much sense to have the number of objects for S3, that would be like a global metric. We also need a dimension to say, “hey this is the metric for the buckets and this is the number of objects”.
From now, things can start to get a bit nasty because imagine that every time you create a new dimension you are actually creating a new metric.
When you go to the CloudWatch console, you will see that we will have a metric called CPU utilization for a given instance and we will have a CPU utilization for other instance. If we change the information on our dimensions, that will appear as a new metric, so every time we create a new dimension, we are actually creating a new metric.
Just to illustrate a little bit more, what a metric would look like this is the information that we need to provide when we are sending a new metric to CloudWatch using the AWS CLI. We need to provide a metric name and we need to provide a time-stamp of value and we need to say what type of unit is there. In this case, Unit Count. We are going to cover that in the next lecture, so don't worry about the unit right now.
As I said, we can have default metrics. The EC2 service will have its own metrics. The RDS service will have its own metrics and so on. And we also have the ability to send custom metrics to CloudWatch. We will cover that in the next lecture but right now what you need to know is that we can send metrics from whatever we want to send to CloudWatch.
Also metrics live for 14 days. So a single metric will live for only 14 days. If this metric, for example, is 14 days old, it gets deleted but can still see the CPU utilization for this particular instance in the EC2 console but you won't be able to see these metrics so that we will change your graphs and so on and the same thing will happen to these other metrics after this particular metric gets 14 days old it gets deleted and so on and so on.
Metrics exist only in the region in which they are created. So if you are using EC2 instances in the Oregon region you will only be able to see these metrics in the CloudWatch console in the Oregon region. There is only one exception for this rule which is billing.
CloudWatch also monitors your billing information but all these metrics are sent to the Northern Virginia region. Even if you are deploying resources to other regions, if you are deploying resources on the Oregon region and so on, you still need to go to the Norther Virginia region in order to see the billing metrics. You can't delete any metrics. Since metrics automatically expires after 14 days, the only thing that you need to do is not send more data to it. After 14 days the metric will disappear.
You can export the data from CloudWatch using the AWS CLI or some SDK but there is an out-of-the-box solution for that. So you need to implement your own logic and you need to deploy things by yourself.
Enough talking, let's go now to the AWS console and have an overview about monitoring with CloudWatch. So here at the AWS console, let's go first to the CloudWatch console. And in here, we can see that we have something familiar. We can see the namespaces. These are the services that I've been using on this account and if we are clicking here for example on EC2, we can see the metrics that we have for all of the instance that I launched in the last two weeks in this account.
And as you can see, this is the instance that we launched the Pizza Time application. If we select the CPU utilization we are not actually selecting the metric for this particular instance at this moment. We are actually selecting a dimension which holds the CPU utilization from all these instances that I launched in the last two weeks.
If we are clicking here, then we are selecting only the instance that we want, only the instance that is running the Pizza Time application. We have a very useful graph doing here. We can specify how much we want to go with this graph too. For example if you want to see the average utilization of the last day of our instance we can simply select the period of timing here and we can see.
We can change for hours and also we can set a Q-stone time-stamping here. If you want things from five hours and a half, we can simply select the time and the graph will be generated for us.
If we want to share this same graph, maybe you identified an outage in the application and you want to share this graph with someone in your team, you can simply click in here and copy the URL for this graph and you can share with people that has access to the account and everytime people access this URL that you share it will show the same graph that is seen in the other screen. So that's a very useful tool.
I also want to show you that you don't really need to go to the CloudWatch console to see metrics about your resources. For example, let's go to the EC2 console and in here I want to see the metrics for the instance there is running the pizza time application. So we go to the instance stage and we select the instance that you want and we can see the monitoring tab that we will have some useful information here and this information is related to the single instance so we can have the CPU utilization, we can have information about the disks.
Just one thing to consider in here is that all these metrics relates to the disks of the physical machine. In this case, since we are using an EBS volume to launch this instance we won't have any metric in here because we are not actually using a physical disk to run this instance.
I will show you where to check out the EBS metrics in a field. We also have network metrics and these three last particular metrics are very important. The last one is status check failed system. We changed the value to one if there is a system fail. Meaning if there is an RH on the AWS side, this status check will monitor your instance. So for example if your operating system can't boot up if you change the value to one or if your instance gets stuck for some reason it will change the state of this metric to one. The last one is status check failed any. We will change the status check to one every time each of this two order's status check fails. So if there is a failure of any kind no matter if it is in our instance or if it is the AWA system that will change this metric to one.
So these are very important metrics and these are great for trouble shooting because every time you see a problem especially if you are seeing a problem in the AWS system you need to stop your instance and then you start it again because that will force AWS to relaunch your instance in a different host.
And by the way, that also solves a lot of different problems. I've seen people solving problems, DNS problems, only by stopping their instance and starting it again. That will be my first suggestion. Every time you see a problem that doesn't really make much sense, try to stop your instance if you can and start it again.
Another thing that I want to show you in here in the EC2 console is that right now the EC2 instance is on basic monitoring. That means that AWS will collect metrics every five minutes. Every five minutes AWS will collect information about the CPU utilization, the disk reads and so on. If we want to increase that maybe we have a very critical application that we want you to look close to it. We can enable detailed monitoring. And what detailed monitoring will do is instead of checking your instance every five minutes that will check your instance every minute. Every minute your instance will be sending data to CloudWatch and you can have a faster response to a failure of any kind by using detailed monitoring. I wont enable that for these instances so I will just close in here.
And, talking a little bit about EBS volumes, if you want to see the metrics for the EBS volumes, we can see that on CloudWatch. We can simply click here, select the EBS ID and for EBS we also have the monitoring tab. We will be able to see some metrics related to the EBS volume itself. We can also see that on the CloudWatch console but I find that it is easier to see the metrics for a particular volume or a particular instance in the EC2 console. It is much quicker. I have the information that I need right there. Continuing talking about monitoring with CloudWatch, let's now take a look at the RDS console.
If we go on the RDS console and we select the RDS database, we will have kind of the same information that we have for EC2. Lets take a look. We can select the instance and in here we already have some basic monitoring information. This monitoring information comes from CloudWatch and if we want to have more detail you can click on show monitoring and we will have kind of the same interface that we have for EC2. We can see the CPU utilization. We can see the connections. We can see the free star space and so on. So again this is very useful because we can simply click select our database instance and we can instantly see what is going on with for instance.
But for other surfaces we don't have that very useful monitoring tab in the console. We always need to go to the CloudWatch console. It is the case for the S3 surface. We don't have a monitoring tab or a monitoring view inside the S3 console.
Every time we want to know an information about a particular S3 bug we need to go in here. And we need to select the metric that we want and we will visualize the data.
An alarm is a way to evaluate the metric against a threshold in a given period. And the alarm you have in a state depending on the value of that metric over time, the state of the alarm will change.
We have three possible states for alarm. The “ok” alarm will be when you have a threshold defined. For example, this is the CPU utilization metric. Let's say that we create an alarm for this metric. So we say that we want our CPU utilization below 50% and we can see here that in the last 50 minutes our CPU utilization is always below 50% so in this case the state of our alarm would be okay. If, somehow, our CPU utilization goes beyond that and reaches 50%, then our alarm state would change to “alarm” meaning that we would probably have a problem. We could specify in low CPU alarm saying that every time the CPU utilization is below let's say 20% we would have an “alarm state” and we could set that alarm to remove an instance from the autoscaling group to save some money.
Insufficient data is self explanatory. It is when we don't have enough data to either have an OK state or an alarm state. By the way you can change the state of an alarm using the AWS CLI and also some SDKs and the partial tools for AWS and you can have multiple actions per alarm.
You can set a notification that will use the SNS service to notify a topic and send a message to the subscribers of that topic.
You can have an autoscaling action. This autoscaling action only works if you are monitoring an autoscaling related metric. So if you are monitoring the CPU utilization for all the instance in the given autoscaling group you can set an autoscaling action to say for example to increase or decrease the number of instances in that group. And this is kind of new.
You can also set EC2 actions. An EC2 action will take action in the EC2 instance related to the alarm itself. You can have four types of EC2 actions. You can have the recover action but note that the recover action can only be used if you are monitoring the status check failed system. Remember that this metric relates to problems in the AWS side of things so we can use the recover action to launch that instance in another host and that will most likely solve the problem. We can set an stop action that will stop the instance. We don't need to associate this action to this status check fail system. We can use other metrics to that. We can terminate the instance and we can reboot the instance.
When you go to the CloudWatch console and click on create alarm, you see this screen and in this screen you can see that you need to specify a name to the alarm, a description and in here you specify the threshold. So you can see if the CPU utilization in this case is higher or equal than zero or we could specify 50 for one consecutive period.
We will specify the period in here so we say we will evaluate these alarms against the threshold every five minutes and if the CPU utilization is beyond 50% for one or more consecutive times we can specify that the state of the alarm will change.
Here we can see the instance ID related to this alarm.
We can see the metric name and in here we can specify the actions for this alarm. We can select more than one action.
We can notify more than one topic.
We can do more than one autoscaling action and so on.
Once we set everything, we can click on create alarm. Another important thing is that when you change the period of the alarm that won't change the EC2 monitoring type because people sometimes get confused about that.
We say that we will evaluate the alarm every five minutes. We can also change that to evaluate the alarm every one minute but that won't change the type of monitoring. If your instance is using basic monitoring, it will remain as basic monitoring.
We have many ways of creating an alarm. The easiest way is by going to the CloudWatch console and we select alarms. We click “create alarm” and we will see a page where we need to select the metric for our alarm.
So I will create an easy to related alarm. So I'll go on per instance metrics and I will select the instance that is running our Pizza Time application.
I will select the CPU utilization metric and I will say that it is a high CPU alarm. I would define the threshold of when it is more than 50% for, I would say, three consecutive periods. I will leave the period as five minutes but we can change that and at this point the only thing that I want to do is receive a notification.
So I will keep this notification action and I will create a new list for me. I will say that the topic name is admin and I will insert an email in here. So every time that the state is “alarm” I will receive a notification.
And we can also say that every time the state is ok we will also receive a notification but that's really not needed. I would delete and create the alarm.
We need to confirm this email but I will do it later. I really don't need to do this right now. CloudWatch can also start logs for us. This is very useful when we have autoscaling groups or when we don't want to for example get inside an instance to analyze logs. We can send the logs to CloudWatch logs and we can have a central place to restart logs and the logs will remain even if we terminate the instance which is very good because sometimes when we are using autoscaling we can have a lot of instances being terminated and we don't know the reason so we can use CloudWatch logs to restart the logs of our instances and even if the instance gets terminated we can still check the logs of the instances in that autoscaling group.
Events are a little bit like alarms but they don't monitor metrics. They monitor changes in the AWS resources, so we can select an event source. We can select an instance change or we can select autoscaling change. Maybe we want to do any specific action every time we have an EC2 launch in our autoscaling groups, we can create any schedule which will basically behave as a crown job and we can also monitor AWS API calls but for that we need also to enable Cloudin our account. And those events can trigger some targets. We can trigger a new lambda function. We can send a message to an SNS topic. We can put some message in an SNS queue and so on.
That's it for this lecture. We already covered too much. We will continue covering CloudWatch in the next lecture.
About the Author
Eric Magalhães has a strong background as a Systems Engineer for both Windows and Linux systems and, currently, work as a DevOps Consultant for Embratel. Lazy by nature, he is passionate about automation and anything that can make his job painless, thus his interest in topics like coding, configuration management, containers, CI/CD and cloud computing went from a hobby to an obsession. Currently, he holds multiple AWS certifications and, as a DevOps Consultant, helps clients to understand and implement the DevOps culture in their environments, besides that, he play a key role in the company developing pieces of automation using tools such as Ansible, Chef, Packer, Jenkins and Docker.