Observability in AWS
The course is part of this learning path
This course covers the core learning objective to meet the requirements of the 'Architecting for Management & Governance in AWS - Level 2' skill
- Understand the different AWS management services available to monitor the performance of a solution
- Apply Amazon CloudWatch monitoring contols to respond to system-wide performance changes
- Apply AWS Config controls to manage compliance based upon business guidelines
Hello! I'm Stephen Cole, a trainer, and certification specialist with Cloud Academy and I'm here to present an overview of Amazon CloudWatch and CloudWatch Alarms.
Before I get to those things, I want to get a little philosophical. One of the life lessons that years of schooling never prepared me for is this: Naming things is hard. This is true, I think, for all things: Children, pets, variables in a computer program, and AWS Services & Features.
I'm sure that AWS tries to name things accurately. However, sometimes, while accurate, the names can cause some confusion. The documentation and training materials for Amazon CloudWatch that come from AWS often seem to reference the features of CloudWatch as independent services.
You might hear about things like CloudWatch Alarms, CloudWatch Logs, and CloudWatch Insights. To be clear, Amazon CloudWatch is the name of the service. The features of CloudWatch include Alarms, Logs, Metrics, Events, ServiceLens, Container Insights, Lambda Insights, and Anomaly Detection.
Looking at the AWS Console, there is no option that can be selected for CloudWatch Anomaly Detection. While Anomaly Detection is a feature of Amazon CloudWatch, it is more accurately described as a feature of CloudWatch Alarms. And, as I mentioned a moment ago, CloudWatch Alarms is really a feature of Amazon CloudWatch.
Again, naming things is hard but, as always, I'll do my best to minimize confusion. Taking this a step further, Amazon CloudWatch Anomaly Detection accurately describes the feature but it does not come close to explaining how useful it is. Personally, I think it would help if the name of the feature described why someone would want to use it. That is, what's the pain point that would cause someone to look for it?
A better name--at least in my opinion--would have included the concept of Alarm automation and maintenance. With this in mind, I think it's best to start with a quick review of Amazon CloudWatch and CloudWatch Alarms to see why alarm automation is needed.
Amazon CloudWatch is a monitoring and observability service from AWS. It can provide a unified view of the performance and health of AWS resources as well as applications & services running both inside the AWS cloud and on-premises systems.
CloudWatch was originally launched in 2009 and, over the years, has been expanded to the point where it can be used to monitor infrastructure, systems, applications, and business metrics.
Amazon CloudWatch is also an observability service. Observability is a term from control theory and is a measure of how well internal states of a system can be inferred by knowledge of its external outputs.
Lots of fancy textbook-sounding words there. Simply put, observability is answering questions about the inside of a system by looking at the outside of it. Using CloudWatch as an observability service removes the need to insert code into a system to debug, predict, or answer new questions.
CloudWatch Alarms can be set to notify when there are operational issues, CloudWatch Dashboards provide visualizations, CloudWatch Events trigger automations to remediate issues, and CloudWatch Logs Insights can query log data at scale.
The automation of metric collection and reporting--such as this--minimizes the risk of human error. When combined with automation and notifications, it also maximizes the uptime of a system or an environment.
Amazon CloudWatch is a powerful tool that helps troubleshoot issues, discovers insights for application optimization, and can ensure an environment is running smoothly.
That's the good news.
The not-so-good news is that, sometimes, working with Amazon CloudWatch is almost an art form. My experience with creating CloudWatch Alarms is that they can be tricky to create and maintain. No two systems have the same needs.
Alarms with their corresponding thresholds cannot generally be reused between different applications.
When I was a young child, I'd ride with my father to various places in the car. He'd sometimes say, "Did you hear that? What was that sound?" All I heard was road noise. As an adult with my own automobile, I've learned to recognize the sounds my car makes. When I'm a passenger in someone else's car--even if it is the same make and model--all I hear is road noise.
I've learned to recognize the sounds of my environment and how my car feels when I drive it. I know, almost instinctively, when something's wrong.
Similarly, in the cloud, each application, system, or environment has its own unique feel, sound, and shape. Monitoring needs to be adapted to suit unique needs.
That's why I feel, in addition to the science involved, there's an art to it. Alarms for one application are not suitable for all; even across an organization. Metrics have to be selected, thresholds chosen, and time periods for monitoring determined.
Also, the term monitoring is a general one and its meaning depends on its context. A group of software components used for data collection, its processing, and presentation is referred to as a monitoring system.
Amazon CloudWatch is used to create an awareness of the state of a system or environment.
State awareness is a process that is both proactive and reactive.
Proactive monitoring is a type of surveillance. It involves watching visual indicators such as time-series data and dashboards. The techniques used in the monitoring systems like CloudWatch include real-time processing, statistics, and data analysis.
Reactive monitoring uses automation to trigger notifications when there has been a change in a system’s state that has deviated significantly from a baseline. This is often called alerting.
Whether migrating existing systems to AWS or building solutions directly in the cloud, a monitoring and observibility system promotes efficient and cost-effective operation.
There are two types of metrics that are pushed to Amazon CloudWatch; ones that are Period Driven and others that are Event Driven.
Period-driven metrics are ones that occur with a regular frequency and interval.
Some services publish periodic data points as metrics to CloudWatch, but there are times a service might have periods without data points.
Consider Amazon EC2. The default period for monitoring an EC2 instance is five minutes.
This means that the CPUUtilization metric of an EC2 instance pushes a data point every five minutes to Amazon CloudWatch. When the instance is stopped, EC2 does not push any data points to CloudWatch.
The point that I'm trying to get to is that metric data is pushed to CloudWatch every period and this concept is important to understand because it's fundamental to how CloudWatch Alarms works. I'll get to Alarms, shortly.
Now, if a CloudWatch metric is not Period-Driven, it is Event-Driven.
Consider the metric that counts the number of 500-level HTTP error codes that originate from an Application Load Balancer.
The ALB sends data points when there's an error or an event. If there are no errors during a period, the result is an empty dataset.
If an alarm is monitoring a metric that has no data points during a given time by design, the state of the alarm is INSUFFICIENT_DATA during those periods. This is different from sending zero. A zero would affect any math used on the metrics over time.
Sometimes, not every expected data point for a metric gets reported to CloudWatch. This can happen when a connection is lost, a server goes down, or when a metric reports data intermittently by design.
If an alarm's thresholds are narrow, lost data could trigger an alarm. Likewise, if an alarm's thresholds are wide, lost or missing data might not trigger an alarm when it should.
CloudWatch lets you specify how missing data points should be treated when evaluating an alarm's conditions and force an alarm to be in either an ALARM or an OK state.
This allows you to configure an alarm so that it goes to an ALARM state only when appropriate for the type of data being monitored.
Inside Amazon CloudWatch there are two types of alarms and alarms can be in one of three states.
The two types of CloudWatch alarms are Metric Alarms and Composite Alarms.
The three states are OK, Alarm, and Insufficient Data.
A Metric Alarm watches a single metric or the result of an expression or calculation.
I think of a Metric Alarm as being elemental like Hydrogen or Oxygen. It's a metric that, at its core, is a single fundamental property and cannot be reduced or simplified.
A Composite Alarm is a collection of alarms that goes into an ALARM state when ALL of the conditions for that alarm are met and provide an overall state for a grouping of resources like an application, an AWS Region, or an Availability Zone.
To me, this is like a molecule. Instead of Hydrogen or Oxygen, I now have a combination of the two and it's water, H20.
Composite alarms are useful because they can reduce alarm noise and allow you to focus on operational issues. When a software deployment breaks a specific application, you might end up spending time managing alarms rather than troubleshooting and fixing the problem.
You can create multiple metric alarms that do nothing and, at the same time, create a composite alarm that uses them. Then, set an alert on the composite alarm that triggers when all of the underlying metrics are in an ALARM state.
If an application issue affects several resources, you will receive a single alarm notification instead of one for each affected service-component or resource.
For example, correctly identifying application issues sometimes requires that you have multiple alarms in place.
Say you have a batch job running at night. It is normal and expected to see 100% CPU utilization since all spare capacity during off hours is being used.
However, if at the same time IO utilization is at 80% it might indicate that there's a problem with the application.
A composite alarm allows you to alarm only when both conditions occur.
When using CloudWatch Alarms, choose a metric to monitor, and when that metric reaches a value--high or low--you can trigger a notification.
It's not enough to reach a value. ToT trigger an alarm, the metric has to cross a predefined threshold--high or low--and stay there for a period of time.
What state are you in?
No, this is not about being in the United States. It's about a state of existence.
If you've ever taken a physics class, you're aware that matter exists in one of four states: Solid, liquid, gas, and plasma.
Here's your unexpected physics lesson of the day. The state of matter changes as energy is added to it.
I use this analogy because adding or removing energy from matter takes time. Time is a key component to CloudWatch Alarms.
CloudWatch Alarms exist in one of three states: OK, Alarm, and Insufficient Data.
Instead of adding energy to an Alarm to change its state, you add information. This information is, as you can probably guess, data points, and must remain heated or cooled for a chosen amount of time before they actually alarm.
This chosen amount of time is one or more periods.
By having to go above or below a threshold for an amount of time, this behavior minimizes false alarms. A quick spike or dip is ignored and prevents outliers from needlessly triggering notifications or automations.
Still, finding the right thresholds and knowing the amount of time that has to pass before an alarm is triggered; this requires some effort and fine tuning. The numbers are a science, finding the thresholds feels like an art.
It's also a balancing act.
It's important to catch signs of trouble early, but false alarms can trigger automations that could scale out a compute environment and waste money or--probably worse--make an environment scale-in and impact the end-user experience in a substantially negative way.
This is something that I've long called an RBE, a Resume Building Event. Yes, it's important to learn from failure. However, those lessons belong in test and development environments and far away from production.
A fixed threshold--also known as a static threshold--may be good enough for a given application or environment. However, applications that grow organically or seasonally, are difficult to manage using static thresholds.
If the thresholds are too broad, it might not detect unusual behavior.
To compensate, tighter thresholds will catch behavior out of the ordinary but it will also create more false positives over time.
False positives equate to needless work. For years, I've tried to explain to my dad that I'm not lazy, I'm efficient. They look similar but are, in fact, quite different things.
My goal, then, is to figure out what to monitor, decide what thresholds are important--including for how long--and avoid having to research outliers that are essentially accidents.
Amazon CloudWatch is a monitoring and observability service. Monitoring involves collecting metric data. Observability is about inferring the state of a system based on all available data.
Monitoring and observability are required in 21st Century applications and environments.
That said, even the monitoring systems need to be monitored. Historically, this is the role of a systems operator. It's important to regularly adjust and recalibrate alarm thresholds to deal with both cyclic and seasonal behavior.
Often, outliers are random events that have no real meaning. However, it's possible that random occurrences could be indicative of larger issues and are either overlooked or discovered after some type of catastrophic event.
Now, please excuse my Latin pronunciation.
"Quis custodiet ipsos custodes?" is a Latin phrase found in the work of a Roman poet and is literally translated as "Who will guard the guards themselves?"
In modern English, this translates to, "Who watches the watchers?"
That's my question for you, who's watching your Amazon CloudWatch Alarms and their configurations?
Apparently, AWS had the same question asked of them. I don't know if they asked their customer base or if customers reached out to AWS with their issues about monitoring and updating Amazon CloudWatch Alarms. Either way, the why of the question doesn't matter. The issue is, creating Alarms and maintaining them is challenging--and sometimes frustrating--work.
In response to this issue, AWS came up with a solution called Amazon CloudWatch Anomaly Detection. I'm going to present Anomaly Detection in another lecture. While it sounds as if it is a feature designed to detect or identify outliers, this is only partially true. Anomaly Detection is bigger than this. It actually uses Machine Learning to automate alarms.
It feels, in some ways, like a bit of magic; a type of technological wizardry that has potential to simplify CloudWatch Alarms and who doesn't need a little magic once and awhile?
This brings me to the end of this lecture; I hope you found it informative and useful.
As a quick review. While Amazon CloudWatch can be configured to respond nearly instantly to events, it was designed to watch metrics over time and only trigger an alarm once it's crossed a threshold over time.
An alarm can be in one of three states: OK, INSUFFICIENT DATA, or ALARM.
There are two types of alarms; Metric Alarms and Composite Alarms. Metric Alarms watch a single value. Composite Alarms use multiple metrics and trigger when all of them are in an ALARM state. Composite Alarms are useful because they reduce the number of false positives.
Creating alarm thresholds and their related time periods traditionally takes a fair amount of trial and error. No two environments are alike, even inside a single organization.
Someone needs to monitor CloudWatch to ensure that it accurately catches events and that alarms are real. That is someone must watch the watcher.
Because this can be a labor--or programmatically--intensive task, AWS has introduced CloudWatch Anomaly Detection; a feature that uses Machine Learning to automate alarm creation and updating.
Before I close, I want to remind you that your comments, questions, and suggestions are of great value to us at Cloud Academy. We use this information to improve our courses in terms of both content and delivery.
To share your thoughts, send an email to firstname.lastname@example.org. I'd love to hear from you.
I'm Stephen Cole with Cloud Academy and thank you for watching!
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.