1. Home
  2. Training Library
  3. Amazon Web Services
  4. Courses
  5. Designing Highly Available, Cost Efficient, Fault Tolerant, Scalable Systems for Solutions Architect Associate on AWS

Amazon CloudWatch

Start course
Duration3h 8m


Course Description

The AWS exam guide outlines that 60% of the Solutions Architect–Associate exam questions could be on the topic of designing highly-available, fault-tolerant, cost-efficient, scalable systems. This course teaches you to recognize and explain the core architecture principles of high availability, fault tolerance, and cost optimization. We then step through the core AWS components that can enable highly available solutions when used together so you can recognize and explain how to design and monitor highly available, cost efficient, fault tolerant, scalable systems.

Course Objectives

  • Identify and recognize cloud architecture considerations such as functional components and effective designs
  • Define best practices for planning, designing, and monitoring in the cloud
  • Develop to client specifications, including pricing and cost
  • Evaluate architectural trade-off decisions when building for the cloud
  • Apply best practices for elasticity and scalability concepts to your builds
  • Integrate with existing development environments

Intended Audience

This course is for anyone preparing for the Solutions Architect–Associate for AWS certification exam. We assume you have some existing knowledge and familiarity with AWS, and are specifically looking to get ready to take the certification exam.


Basic knowledge of core AWS functionality. If you haven't already completed it, we recommend our Fundamentals of AWS Learning Path. We also recommend completing the other courses, quizzes, and labs in the Solutions Architect–Associate for AWS certification learning path.

This Course Includes:

  • 11 video lectures
  • Detailed overview of the AWS services that enable high availability, cost efficiency, fault tolerance, and scalability
  • A focus on designing systems in preparation for the certification exam

What You'll Learn

Lecture Group What you'll learn

Designing for High availability, fault tolerance and cost efficiency 

Designing for business continuity 

How to combine AWS services together to create highly available, cost efficient, fault tolerant systems.

How to recognize and explain Recovery Time Objective and Recovery Point Objectives,  and how to recognize and implement AWS solution designs to meet common RTO/RPO objectives

 Ten AWS Services That Enable High Availability Regions and Availability Zones, VPCs, ELB, SQS, EC2, Route53, EIP, CloudWatch, and Auto Scaling 

If you have thoughts or suggestions for this course, please contact Cloud Academy at


 Here we are at service number two on our High Availability Top 10 and it is Amazon Cloudwatch. Now Cloudwatch is seriously useful. The defacto monitoring service for AWS Resources, it's like the pulse of your solution, providing you the metrics to quickly diagnose, and dynamically adjust any availability or scalability issue. Cloudwatch logs aspects such as CPU, disk and network activity, Amazon RDS database instances, Amazon DynamoDB, Elastic Load Balancerer, or Amazon Elastic Block Store volumes, are all examples of where Cloudwatch can monitor your services for you. Cloudwatch Basic Monitoring, for Amazon EC2 instances, provides seven pre-selected metrics at five-minute frequencies, and three status-check metrics at one-minute frequencies. That provides us with seven pre-defined metrics, which are collected at five-minute intervals, and three status-monitoring stats that are recorded at one-minute intervals. Now, that's what it says in the documentation, however the first thing you'll notice in the console is that there are actually nine available metrics when you first start up an instance. The metric choices we've got are CPU utilization, which is the percentage of allocated EC2 compute units that are currently in use on the instance. We've got disk-read ops and disk-write ops. Both of those are collected as account value, and they're the completed write or read operations from all instance store volumes available to the instance in a specified time. We've also got disk-read bytes and disk-write bytes, which are bytes written and read to all instance store volumes available to the instance. Another difference is the networking one, so we've got network-in and network-out, which is the number of bytes received on all network interfaces by that instance. Network packet-in and network packet-out are the number of packets received on all network interfaces by the instance. Now both of these two are only available in the basic monitoring service. So that's where there's a little difference, because we've actually got nine options showing there, where the documentation clearly says we have seven, so the network ones are combined. Now the three status-check metrics, which are done at one-minute intervals, are status-check failed, which is a combination of the status-check failed instance, and status-check failed system metric, and that reports if either of those two system checks have come in with either a zero or a one in a one-minute frequency. Now the status-check failed instance reports whether the instance has passed the EC2 instance status check in the last minute. And the status-check failed system, both very, very useful for telling whether your system is active or healthy, reports whether the instance is past the EC2 system status check in the last minute. And again, the values that are used for that are a zero or a one. So, those are the key metrics that we get with basic monitoring. Now, there's another one that can show up if we've launched the T2 instance, i.e. one of the free tiers, then you see two additional panels in your Cloudwatch reports. We've got two metrics which are captured for T2-only instances, which is a CPU credit usage as account value, and our CPU credit balance, which is another account value. And again, those are collected in five-minute intervals. Now, the metrics that are collected by Cloudwatch are available for two weeks. So if you wanna keep the statistics for longer than that, you can retrieve them using the getmetricsstatistics API. Or there are a number of partner solutions that you can use to collate and store Cloudwatch metrics and give yourself an extra layer of reporting. Cloudwatch metrics cannot be deleted, but they do automatically expire after this two-week period if no new data is published to them. So, Cloudwatch stores the metrics for terminated EC2 instances or deleted Elastic Load Bouncers for that two-week period.

Now when you're looking at the graphs, the window can look quite different for various metrics, but it can also look the same. So if you're looking at metrics for a five-minute and a one-minute period, they can look quite similar. So if you view the same time window in the five-minute period versus a one-minute period, you can see that the data points are displayed in different places on the graph. Now for that period you specify in your graph, Cloudwatch will find all the available data points and tries its best to calculate a single, aggregate point to represent that entire period. When it's a five-minute period, the single data point is placed at the beginning of the five-minute time window. And when we've got a one-minute period, the single data point is placed in the one-minute mark. So it's best to use the one-minute period for troubleshooting and other activities that require the most precise graphing. And remember that one-minute intervals are only enabled when you have detailed monitoring enabled. With our default metrics, one common confusion point is around memory, so with virtualized technology it is quite difficult to report on memory usage. So while we've got those seven or nine metrics that we can actively show with our basic monitoring, and do provide us a lot of granularity on how our EC2 instances are performing, we can't by default see what the memory usage is. So if we did want to report on that, then we would need to install a daemon or an agent on the EC2 instance itself, and have Cloudwatch collect and report on those metrics. Now monitoring data is retained for two weeks, even if your AWS resources have been terminated. Cloudwatch provides the ability to set alarms on these and other metrics. It is also possible to include custom metrics and create Cloudwatch alarms. You can create a Cloudwatch alarm that sends an Amazon simple notification service message when the alarm changes state. An alarm watches a single metric over a time period you specify and performs one or more actions based on the value of the metric relative to a given threshold over a number of time periods. The action is a notification, sent to the Amazon simple notification service topic, or auto scale policy. Alarms invoke actions for sustained state changes only. Cloudwatch alarms will not invoke actions, simply because they are in a particular state. The state must have changed and been maintained for a specified number of periods. After an alarm invokes an action due to a change in state, its subsequent behavior depends on the type of action that you have associated with the alarm. For auto scaling policy notifications, the alarm continues to invoke the action for every period that the alarm remains in a new state. For Amazon simple notification service notifications, no additional actions are invoked. An alarm has three possible states. Try to remember these. OK, the metric is within the defined threshold. Alarm, the metric is outside of the design threshold. Insufficient data, the alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state. Let's have a look at our sample question. In the basic monitoring package for EC2, the key word here is basic monitoring and EC2. Amazon Cloudwatch provides the following metrics: A. Web server visible metrics such as number of failed transaction requests. No, that would not be a default metric under the basic monitoring package. You could do that, but it would be a part of a detailed or a custom monitoring metric. B. Operating system visible metrics such as memory utilization. No, that's something that we can't get out of the basic package, memory is not when it's visible metrics such as number of connections, not in the basic EC2 package. Now if it said, "Basic monitoring package for RDS," then that question might be connect correct. Hypervisor visible metrics such as CPU utilization. Yes, because the hypervisor reporting is something that Cloudwatch does provide as part of the basic EC2 package. So the answer is D.

About the Author

Learning paths23

Andrew is an AWS certified professional who is passionate about helping others learn how to use and gain benefit from AWS technologies. Andrew has worked for AWS and for AWS technology partners Ooyala and Adobe.  His favorite Amazon leadership principle is "Customer Obsession" as everything AWS starts with the customer. Passions around work are cycling and surfing, and having a laugh about the lessons learnt trying to launch two daughters and a few start ups.