Observability of a System


Amazon CloudWatch
Anomaly Detection
PREVIEW14m 35s
Deeper Dive
Observability in AWS

The course is part of this learning path

Start course
2h 12m

This course covers the core learning objective to meet the requirements of the 'Architecting for Management & Governance in AWS - Level 2' skill

Learning Objectives:

  • Understand the different AWS management services available to monitor the performance of a solution
  • Apply Amazon CloudWatch monitoring contols to respond to system-wide performance changes
  • Apply AWS Config controls to manage compliance based upon business guidelines

Let’s walk through an architecture, and get an idea of how some of the observability services in AWS can work together to help you solve a problem. Let’s take an example application: say a website that sells cat photos. This e-commerce site runs on an Amazon Elastic Container Service cluster on Amazon EC2 instances and it also has a series of Lambda functions reading from a DynamoDB table that updates the status of the top buyers of cat photos on your site. So you have the e-commerce side and you also have the leaderboard side of your application.

Let’s say you want to begin instrumenting this app with the basics of metrics, traces, and logs. 

For metrics and logs, you use CloudWatch and install the CloudWatch agent on your EC2 instances. CloudWatch provides out-of-the-box metric functionality for AWS services like EC2, ECS, Lambda, RDS, and DynamoDB. While the AWS-provided metrics are helpful, most customers need more visibility into their system, so you can also choose to create CloudWatch custom metrics to get the additional information you need from your app. For example, perhaps you’d want to collect data on page views for your cat photo site. That won’t come out of the box from CloudWatch, so you’d need to create a custom metric. 

This used to be important for container workloads. In the past, if you wanted to get service or task-level metrics for ECS, you’d have to create custom metrics.

However, now, you can use another service called Amazon container insights to get this information. This service enables you to more easily monitor Amazon ECS and Amazon EKS workloads. You can think of it as your “one stop shop” for all data regarding your ECS or EKS clusters, with alarms, metrics, and logs for your containers. 

With container insights, you get additional metrics such as task count, service count, deployment count, and container instance count. So you no longer need to create custom metrics for these data points. And once you set up container insights for Amazon ECS or EKS, you can view them like any other metric, creating dashboards or setting up alarms based on these metrics.

So, with Container Insights and CloudWatch metrics, you can gain insight into the e-commerce side of your application. For additional metrics on the leaderboard side of your app, you can use AWS Lambda insights.  With Lambda insights, you can track lambda-specific metrics like cold starts and lambda worker shutdowns. 

The way it works is you modify your Lambda function monitoring details to enable Lambda insights with a push of a button. And behind the scenes, AWS will add a Lambda layer to your function and add in additional policies to your execution role so it can collect the data it needs. 

Once container insights and Lambda insights are enabled, they begin generating log events using the embedded metric format, which enables metric data to be captured in logs. These log events are called performance logs. The service then extracts the metric data out of the performance logs. 

These performance logs and any logs you collect in CloudWatch can be queried and analyzed using CloudWatch Logs Insights. 

With Logs insights, you can perform queries to search through and analyze your log data. It uses its own simplistic query language so you can display, filter, sort, and limit your log data - making it easier to find trends and correlate data. 

Next, you can begin instrumenting your application with traces. To do this you can install the X-ray agent on your EC2 instance, or create a docker image that runs the X-ray Daemon on your ECS cluster. To use with Lambda, no installation is needed, you just need to enable the service with a push of a button in the Lambda function monitoring details. 

Once X-Ray is enabled, you can then instrument your application with X-Ray using the AWS SDKs or the AWS Distro for OpenTelemetry. Once you’ve instrumented your application, you can then view a service map of your infrastructure, see latency information, request metadata, and more to improve the performance of your application. 

X-Ray integrates deeply with CloudWatch ServiceLens, CloudWatch Synthetics, and CloudWatch RUM. ServiceLens provides an end-to-end view of your application, and is often the first place people look to troubleshoot their app. In ServiceLens, you can see bottlenecks, and identify which users of your services are impacted, as well as look at metrics and log data. 

With Synthetics, you can use canaries to perform the same actions as your users, to ensure a positive customer experience to monitor for issues like dead links, transaction issues, latency issues and more. And with CloudWatch RUM, you can look at client-side data for your application to get better insight into actual user sessions. 

So a typical journey to debug a problem might look like this: 

  • A CloudWatch alarm is raised

  • You receive an alarm and start to investigate the problem by looking at the metric associated with the alarm.

  • From there, you can view an end-to-end view of your app in ServiceLens. With this, you can see if there’s high latency or bottlenecks 

  • After that, you can view x-ray trace data to see how customer requests are flowing through your application,

  • Once you pinpoint what service you think may be the issue, you can look at metrics and logs to verify the impact and correlate data on what might have been the root cause. 

  • And then, you can query logs using log insights and query metrics using metric insights to search for patterns, and answer questions like “why did my metric spike?” and more.  

That’s all for this one! See you next time.


About the Author
Learning Paths

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.