Observability of a System
Start course

In this course, we will explain the concept of observability in AWS.

Learning Objectives

  • What observability is 
  • What AWS services can help you achieve observability
  • The open-source services you can use to gain insight into your systems 

Intended Audience

This course has been created for those looking for an overview of the observability services in AWS, including native services like CloudWatch and X-Ray and the monitoring services built off of them, as well as open-source services like Grafana, Prometheus, and standards like OpenTelemetry.


To get the most out of this course, you should have a good understanding of the AWS ecosystem, including knowledge of compute services like ECS and EKS, Lambda, and EC2. Some familiarity with Amazon CloudWatch and AWS X-Ray will also help to better understand some of the practical parts of this course. If you’d like to take a course specifically on Amazon CloudWatch or X-Ray, feel free to check out the following courses:


Let’s walk through an architecture, and get an idea of how some of the observability services in AWS can work together to help you solve a problem. Let’s take an example application: say a website that sells cat photos. This e-commerce site runs on an Amazon Elastic Container Service cluster on Amazon EC2 instances and it also has a series of Lambda functions reading from a DynamoDB table that updates the status of the top buyers of cat photos on your site. So you have the e-commerce side and you also have the leaderboard side of your application.

Let’s say you want to begin instrumenting this app with the basics of metrics, traces, and logs. 

For metrics and logs, you use CloudWatch and install the CloudWatch agent on your EC2 instances. CloudWatch provides out-of-the-box metric functionality for AWS services like EC2, ECS, Lambda, RDS, and DynamoDB. While the AWS-provided metrics are helpful, most customers need more visibility into their system, so you can also choose to create CloudWatch custom metrics to get the additional information you need from your app. For example, perhaps you’d want to collect data on page views for your cat photo site. That won’t come out of the box from CloudWatch, so you’d need to create a custom metric. 

This used to be important for container workloads. In the past, if you wanted to get service or task-level metrics for ECS, you’d have to create custom metrics.

However, now, you can use another service called Amazon container insights to get this information. This service enables you to more easily monitor Amazon ECS and Amazon EKS workloads. You can think of it as your “one stop shop” for all data regarding your ECS or EKS clusters, with alarms, metrics, and logs for your containers. 

With container insights, you get additional metrics such as task count, service count, deployment count, and container instance count. So you no longer need to create custom metrics for these data points. And once you set up container insights for Amazon ECS or EKS, you can view them like any other metric, creating dashboards or setting up alarms based on these metrics.

So, with Container Insights and CloudWatch metrics, you can gain insight into the e-commerce side of your application. For additional metrics on the leaderboard side of your app, you can use AWS Lambda insights.  With Lambda insights, you can track lambda-specific metrics like cold starts and lambda worker shutdowns. 

The way it works is you modify your Lambda function monitoring details to enable Lambda insights with a push of a button. And behind the scenes, AWS will add a Lambda layer to your function and add in additional policies to your execution role so it can collect the data it needs. 

Once container insights and Lambda insights are enabled, they begin generating log events using the embedded metric format, which enables metric data to be captured in logs. These log events are called performance logs. The service then extracts the metric data out of the performance logs. 

These performance logs and any logs you collect in CloudWatch can be queried and analyzed using CloudWatch Logs Insights. 

With Logs insights, you can perform queries to search through and analyze your log data. It uses its own simplistic query language so you can display, filter, sort, and limit your log data - making it easier to find trends and correlate data. 

Next, you can begin instrumenting your application with traces. To do this you can install the X-ray agent on your EC2 instance, or create a docker image that runs the X-ray Daemon on your ECS cluster. To use with Lambda, no installation is needed, you just need to enable the service with a push of a button in the Lambda function monitoring details. 

Once X-Ray is enabled, you can then instrument your application with X-Ray using the AWS SDKs or the AWS Distro for OpenTelemetry. Once you’ve instrumented your application, you can then view a service map of your infrastructure, see latency information, request metadata, and more to improve the performance of your application. 

X-Ray integrates deeply with CloudWatch ServiceLens, CloudWatch Synthetics, and CloudWatch RUM. ServiceLens provides an end-to-end view of your application, and is often the first place people look to troubleshoot their app. In ServiceLens, you can see bottlenecks, and identify which users of your services are impacted, as well as look at metrics and log data. 

With Synthetics, you can use canaries to perform the same actions as your users, to ensure a positive customer experience to monitor for issues like dead links, transaction issues, latency issues and more. And with CloudWatch RUM, you can look at client-side data for your application to get better insight into actual user sessions. 

So a typical journey to debug a problem might look like this: 

  • A CloudWatch alarm is raised

  • You receive an alarm and start to investigate the problem by looking at the metric associated with the alarm.

  • From there, you can view an end-to-end view of your app in ServiceLens. With this, you can see if there’s high latency or bottlenecks 

  • After that, you can view x-ray trace data to see how customer requests are flowing through your application,

  • Once you pinpoint what service you think may be the issue, you can look at metrics and logs to verify the impact and correlate data on what might have been the root cause. 

  • And then, you can query logs using log insights and query metrics using metric insights to search for patterns, and answer questions like “why did my metric spike?” and more.  

That’s all for this one! See you next time.


About the Author
Learning Paths

Alana Layton is an experienced technical trainer, technical content developer, and cloud engineer living out of Seattle, Washington. Her career has included teaching about AWS all over the world, creating AWS content that is fun, and working in consulting. She currently holds six AWS certifications. Outside of Cloud Academy, you can find her testing her knowledge in bar trivia, reading, or training for a marathon.