Metrics and Anomaly Detection
The course is part of this learning path
This course covers Amazon CloudWatch and CloudWatch Alarms using Anomaly Detection.
Amazon CloudWatch is the monitoring and observability service from AWS. The phrase Anomaly Detection implies that this feature is used to detect outliers but this is an understatement. It is a feature of CloudWatch that uses Machine Learning to automate the creation of alarms and their related thresholds.
This course includes a review of Amazon CloudWatch and the challenges of setting and maintaining alarms. It covers how machine learning with Anomaly Detection helps setting alarms and managing/maintaining their thresholds.
You'll learn how to create a CloudWatch Alarm using Anomaly Detection and learn what types of metrics are suitable for use with Anomaly Detection.
- Gain a high level of Amazon CloudWatch
- Review how monitored metrics go into an ALARM state
- Learn about the challenges of creating CloudWatch Alarms and the benefits of using machine learning in alarm management
- Know how to create a CloudWatch Alarm using Anomaly Detection
- Learn what types of metrics are suitable for use with Anomaly Detection
This course is for anyone who wants or needs to create CloudWatch Alarms that are almost completely automated.
To get the most out of this course, you should have some experience running workloads in the AWS cloud, know what Amazon CloudWatch is, know how to create CloudWatch Alarms, and how to trigger an action or notification based on an Alarm's state.
In 2019, AWS enhanced Amazon CloudWatch by adding a feature called CloudWatch Anomaly Detection. It is powered by Machine Learning and improves CloudWatch Alarms by automating their creation and maintenance. Removing most--but not all--manual intervention makes the Alarms more effective and efficient.
While its name--Anomaly Detection--implies that it was created to detect outliers, in reality, it was designed to automate the process of creating and maintaining CloudWatch Alarms. It uses Machine Learning to determine what normal and abnormal behavior for the metric is.
When using Anomaly Detection, Amazon CloudWatch will monitor chosen metrics and automatically go into an ALARM state when problematic behavior is detected.
If you're new to Machine Learning, there's a lot to it. However, there are some basic principles that are fairly easy to outline. I want to cover them, briefly, to help you understand the process.
While Machine Learning uses a lot of math--especially Linear Algebra--once a model has been created, it can be reused. This means that people who are not mathematicians can use them almost like a commodity.
One of the things that makes Machine Learning algorithms powerful is that they use data for training purposes. Instead of looking for a specific thing--like an image or a word--Machine Learning can be used to find patterns that might otherwise be hidden from view.
It's no accident that it's called Machine Learning. The algorithm learns as it is being used and becomes more refined--it improves--over time. I suppose that you could say that it gets smarter. However, some people are uncomfortable assigning intelligence to machines.
I can illustrate--in a general way--how this works without using any math.
Consider the letter "A." If I were to create an algorithm to detect the letter "a" in an image, I'd have to define, mathematically, what the letter looks like first.
The challenge is that, with this method, my algorithm would only find the exact version of the letter. The typewritten version of the letter "a" is very different from how I write it by hand. Depending on my needs, I'd have to define--with a fair amount of precision--what I needed to find.
Artificial intelligence--a precursor to Machine Learning--can remedy this. AI can be used to create an algorithm to recognize the letter "a" in various shapes and sizes.
By itself, this works really well. However, there's a drawback based on our human biases that is not apparent until it is pointed out.
What happens when the letter A looks like this? Is that still an "A" or is it now a "V?" What about this? Are these letters or are they some type of generic pointer?
My point is that, when defining what it is we're looking for, human beings often start with an established point of view or perspective and work backwards from it. Even using Artificial Intelligence, programs that I'd write would assume a certain orientation of the letter A because that's what I expected to find.
Because of this, using a human perspective with a computer model--like what happens using AI--means that some valid matches will be missed. Machine Learning, however, is designed to find patterns in the data no matter how it is orientated.
So, Machine Learning will find the letter A as it hides inside the data even if it is upside down, at an angle, upper-case, or lower-case. Is it perfect? No. Not at all. It's only as good as its training data.
Also, there are different types of Machine Learning algorithms. Ones designed to analyze trends inside of financial data are not going to be suitable for tasks involving image recognition. However, it is much more efficient and requires less human capital to manage and maintain than building individual algorithms manually per use case.
Bringing this back to CloudWatch Anomaly Detection it means that, when turned on, the service will learn--over time--what is normal for what is being monitored.
As much as I'd like to think otherwise, normal is not an absolute value. It's a range that is dependent on multiple factors. I suppose it's also a setting on my clothes washer. However, clothes-washing technology is out of the scope of this lecture.
Tomorrow, if everyone started wearing bright pink shoes and nothing else on their feet, this would become normal behavior. And, well, if bright pink shoes are wrong, I don't want to be right.
Anyway, back to being normal. Even in day-to-day life, many people struggle to determine what normal is, or what it means. Like naming things, it's a challenge that school never prepared me to do.
For Amazon CloudWatch Anomaly Detection, normal is defined by analyzing the historical values for a chosen metric and looking for predictable patterns that repeat hourly, daily, or weekly.
After collecting historical data, the service creates a best-fit model. This model helps to predict the future as well as improve being able to differentiate between normal and problematic behavior.
For it to work, the metric being monitored needs to have a repeatable pattern. If a metric has wide fluctuations or is frequently idle with bursts of activity, Anomaly Detection will not have any practical value. When graphed, the model is shown as a gray band.
When creating a CloudWatch Alarm and using Anomaly Detection, you'll be prompted to enter an "Anomaly detection threshold." This is a number based on a desired standard deviation. Larger numbers for this threshold have a thicker band and a smaller number will be thinner.
The model can be adjusted and fine-tuned as desired and multiple models can be used for the same CloudWatch metric. Alarms can be set to trigger when the metric moves outside of the band, is greater than the band threshold, or is lower than the band threshold.
CloudWatch Anomaly Detection has over 12,000 built-in models and almost eliminates the need to do manual configuration and experimentation when creating CloudWatch Alarms. Any standard or custom CloudWatch metric can be used with Anomaly Detection as long as it has a discernible trend or pattern.
Anomaly Detection algorithms can adjust for the seasonality and trend changes of metrics. Seasonality changes could be hourly, daily, or weekly. Once built, the model will be updated every five minutes with any new metric data.
It takes some time to build a model and--while the model is being built--the alarm state will show INSUFFICIENT DATA.
After enabling anomaly detection for a metric, CloudWatch applies statistical and machine learning algorithms to continuously analyze it to determine normal baselines and reveal anomalies with minimal user intervention.
The algorithms generate an anomaly detection model. This model then generates a band of expected values that represent normal behavior for that metric. The expected values generated by the model can be used one of two ways. They can be used to create alarms or as part of a visualization with a graph.
Remember, an anomaly detection alarm is based on a metric's expected value. These types of alarms don't have static thresholds for determining an Alarm's state. Instead, what they do is compare the metric's value to the expected value based on the anomaly detection model.
The Anomaly Detection band is shown in gray, the metric itself is blue, and, when the metric's actual value goes outside of the Anomaly Detection band it is shown in red. In this graph, there's a spike at 11:45.
In this example, the unexpected change started at 6:00 and lasted until 8:00. At 8:00, this behavior was anticipated and--once again--is back within the range of expected values. Here, there are several spikes.
Anomaly Detection can be enabled using the AWS Management Console, the AWS CLI, or the AWS SDK. Anomaly detection can be performed using the built-in metrics available from AWS as well as with custom metrics.
Before I go any further, I've got a story to share about my experience using Anomaly Detection. When I was creating Alarms using Anomaly Detection, I experienced some--let's call it--unexpected behavior. Okay, more accurately, I thought there was something wrong with the service, the documentation, or my ability to follow directions.
After using it for a while, the Anomaly Detection behavior changed to be in line with what I expected. What happened was that, when I put 50% load on my instance, it went into an Alarm state and stayed there. It didn't seem to learn that 50% was normal.
I went back to the documentation and found this line.
When you enable anomaly detection for a metric, CloudWatch applies machine learning algorithms to the metric's past data to create a model of the metric's expected values.
My problem was that I'd created an EC2 instance and had left it running idle for a significant amount of time before creating the alarm and putting load on the CPU. By significant, I mean days.
This is embarrassing because unused instances cost money. For me, it wasn't much. However, if this was a production environment with tens or hundreds of instances, pennies add up to dollars quickly.
Even though I wasn't looking at the CloudWatch data, it was being collected and the algorithm was using CPU metric data hovering around 1% utilization to create the model.
Also, while researching, I found a blog post with this nugget of information.
For the best result, at least three days of data is recommended.
The lesson I learned was that the model learns from normal data and will only be as good as the data provided.
The fix is that, when creating a model, you can exclude abnormal time ranges from the training data as well as known events--such as performance testing--in advance.
As part of my testing, I created a second, fresh EC2 instance. Instead of having to remove training data, I wanted to see what would happen when I started with no metric data.
After running for a couple of hours, I looked at the CloudWatch dashboard I created.
Here, you can see that it's trying to match the load curve that I created. I have a repeating pattern of 50% load for 4 minutes then a 25% load for 2 minutes. Seeing this made me happy. It's always good to see things work as they should.
However, that happiness was short-lived. For the next hour, I got regular notifications that my instance was in an ALARM state. I don't want to sound too sarcastic but, oh joy.
After the hour had passed, it started to self-correct and my happiness started to return. The problem I'm experiencing is that this instance is only a couple of hours old. It's working with a small amount of data and making some assumptions.
Here's what it looks like over a couple of hours. After another hour went by, the curve started to fit again. After about three hours, the model looked like this. The model's curves are starting to tighten. Still, those red spots on the bottom and top of the curves on the right-hand side triggered an alarm notification.
Here's the metric with the model a couple of hours later. I'll make it a little larger. The model is starting to recognize the pattern better. Zooming into the last hour, I can see that, overall, it's starting to fit.
I'm still getting alarms but there are fewer of them. The more data points that are collected, the better the model, the fewer alarms, and the more accurate the graph. About 24 hours later, the graph looked like this.
A closer look at the past three hours reveals that the model is, indeed, starting to predict the performance of the EC2 instance with more and more accuracy.
Most of the false positives are still on the bottom of the graph but, every now and then, there's still one at the top that triggers an alarm. While I was building this second instance for testing, I let the Python program that generated the load run continuously. I wanted to see the model improve over time.
After the second day, this is what I saw. Very nice. It's getting closer and closer to what I expected. Then, as luck would have it, disaster struck. At some point in the night, something happened to my instance.
Before I continue… Full disclosure, I am many things. A professional programmer is NOT one of them. I sometimes wish I'd gone down that career path because I do enjoy coding. However, I am smart enough to know and humble enough to admit that my code would NEVER keep an airplane in the air or a patient alive in the hospital.
Looking at my instance, my CPU utilization had gone back to ZERO. I'm seriously not happy. Looking a little deeper. I found this in my CloudWatch Dashboard.
Logging into my EC2 instance, I see that the process was now defunct. In Linux, a defunct process is one that has either completed its task--or has been corrupted or killed--but it's child processes are still running. In my case, the process had been corrupted and was now a zombie.
It was in an ALARM state for a while but eventually went back into an OK state. The lesson learned is that Anomaly Detection is great for creating an alarm system that is dynamic and flexible.
However, by itself, its weakness, to me, is that if a system breaks--like mine did--it will accept the change as being a new normal. I'm not opposed to having/creating a new normal. However, if that new normal includes a zombie process, it's probably not the normal I want.
In a production environment, I will need to have other alarms to ensure that the system is running as expected. I'll call that a lesson learned.
Overall, using Anomaly Detection is much easier than having to guess on my own and figure out what thresholds should be. It is not a silver bullet or a magic wand that will make all of your monitoring problems go away. Instead, it is another tool to use that will take some of the friction out of your life while monitoring.
That's it for this lecture. To summarize, Anomaly Detection can learn and model the expected behavior of a metric based on prior data that continues to adapt over time. It will generate an Anomaly Detection confidence band based on the normal ranges generated by the model. Metric values that fall outside the band are considered anomalies.
Alarms can be created based on this normal pattern and triggered when they are “Outside the band”, “Greater than the band” or “Lower than the band.” Amazon CloudWatch Anomaly Detection can be configured using the AWS Console and it also has AWS API support. This means it can be configured using the AWS CLI, the AWS SDKs, and AWS CloudFormation.
In another lecture, I'm going to walk through how to create a CloudWatch Alarm with Anomaly Detection and show its related visualizations. I'm Stephen Cole for Cloud Academy, thanks for watching!
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.