In this course, we will explain the concept of observability in AWS.
- What observability is
- What AWS services can help you achieve observability
- The open-source services you can use to gain insight into your systems
This course has been created for those looking for an overview of the observability services in AWS, including native services like CloudWatch and X-Ray and the monitoring services built off of them, as well as open-source services like Grafana, Prometheus, and standards like OpenTelemetry.
To get the most out of this course, you should have a good understanding of the AWS ecosystem, including knowledge of compute services like ECS and EKS, Lambda, and EC2. Some familiarity with Amazon CloudWatch and AWS X-Ray will also help to better understand some of the practical parts of this course. If you’d like to take a course specifically on Amazon CloudWatch or X-Ray, feel free to check out the following courses:
Let’s say you are a cloud operations engineer for a website that produces quality photos of cats. The photos are so high quality that talk show hosts and influencers are constantly talking about how great they are. Because of this, the traffic to your website has increased over time. This sometimes leads to performance issues on your site. This past Friday, there was a huge issue where customers could no longer see their cat photos.
You, of course, were on vacation when your boss called and said “The cat photo system is down. What’s the problem and when are we going to be back online?” So you go back to your hotel, pull out your work laptop that you brought just in case, and log in to your AWS account. Unfortunately, this has become a typical Friday for you, where you log in to the AWS console to fix some problem with the infrastructure for the cat photos website. You follow your typical problem-solving method: you first detect the issue, investigate it, and then finally remediate the problem.
The detect stage usually is when the error occurs. This is ideally followed by an alert, where you may be paged and a trouble ticket gets created. In this case, the alert was your boss calling you on your vacation. Lucky you. From there, you investigate the problem by looking at traces, logs, and metrics and attempt to correlate the data to find the issue.
And once you find the cause, you can then react to the problem and issue a fix, thus remediating the issue. At this point, you should understand the root cause and can then collaborate with others to ensure this problem doesn’t happen again in the future. This scenario contains examples of both monitoring and observability. Monitoring tells you whether a system is working properly or not, which you discovered it wasn’t, when you were alerted. And observability gives us information about WHY a system isn’t working, which you discovered by looking at logs, metrics, and traces.
Logs, metrics, and traces are what we call the foundation of observability. Metrics are typically numerical data from a specific time period, such as information about CPU utilization or system error rate.
Logs represent time stamped events that happened over a period of time. With logs, you can get information about your resources, requests, and even create counters for how often things happened. You can additionally see your debugging data as well, including any warnings or errors to help you troubleshoot issues.
And traces record the paths taken by requests, typically made by an app or a user on the site. For example, if someone presses “buy cat photo” on your website, tons of systems are working behind the scenes: systems to update the shopping cart, process the payment, user profile services, all working together so the customer can successfully buy a cat photo. Tracing helps you see how the backend systems interact together to fulfill the user’s request.
In AWS, the observability stack starts with these monitoring primitives: the metrics, traces, and logs. To instrument AWS applications, you can use two main services: Amazon CloudWatch and AWS X-Ray. Logs and metrics are captured in Amazon CloudWatch, and traces are captured in AWS X-Ray.
These two services are considered the backbone of the observability stack, and AWS has built other native monitoring services using their functionality. For example, X-Ray functionality and CloudWatch functionality made the creation of Amazon CloudWatch ServiceLens possible. This service uses X-Ray to provide an end-to-end view of your application and combines that with CloudWatch metrics and logs, so you have metrics, traces, and logs all in one place.
Over the years, CloudWatch has become a suite of services, with ServiceLens being just one of them. Other services like Amazon CloudWatch Synthetics and CloudWatch RUM were created to better monitor and test the end-user experience.
And eventually, CloudWatch functionality was expanded to include a set of insights services, such as
- Container insights
- Lambda insights
- Contributor insights
- Application insights
- Log insights
- And Metrics insights
These services are meant to provide additional metrics and logging information for your containers, lambda functions, and applications and provide querying functionality for both metrics and logs. In summary, metrics, traces, and logs are the foundation of observability in AWS. However, there are tons of other services now that you can use to ensure you’re more easily correlating data and properly instrumenting your application.
Alana Layton is an experienced technical trainer, technical content developer, and cloud engineer living out of Seattle, Washington. Her career has included teaching about AWS all over the world, creating AWS content that is fun, and working in consulting. She currently holds six AWS certifications. Outside of Cloud Academy, you can find her testing her knowledge in bar trivia, reading, or training for a marathon.