Scaling and Monitoring
Start course

Getting Started with an Amazon Web Services Solution Real World Practices: In this course, we will untangle the AWS landscape and teach what you need to know to build applications on AWS. This includes understanding what AWS has to offer you if those prepackaged services make sense for your use case, and best practices around scaling, monitoring, security, and cost.


Hello and welcome to the Getting Started with an AWS Solution Real World Practices course from Cloud Academy. My name is Adam Hawkins and I'm your instructor for this lesson. This lessons covers scaling, monitoring, and alerting with CloudWatch. The objectives for this lesson are to design an auto scaling application, create a metric collection plan, and also create an alarm plan.

Let's start off with a brief introduction to CloudWatch. CloudWatch is the AWS product for time series data collection, visualization, and alerting. All AWS services automatically send data back to CloudWatch. This may be data points like the total number of requests through an ELB or the disk usage on an RDS instance. You may configure alarms when certain metrics go above or below certain thresholds. The alarms will make sure your autoscaling actions or even connect to SNS Topics. Remember how I mentioned that SNS is actually the plumbing for all other AWS services? CloudWatch is a useful tool in your toolbox, but don't expect it to be the complete solution for all of your metric collection and visualization needs. CloudWatch is at its best when used between other AWS services and its data is fed back into more advanced visualization tools. That being said, there may be areas where you'll benefit from CloudWatch. If you're working with predefined metrics without many facets or complex visualization, then CloudWatch may fit the bill. Let's spend the rest of the lesson examining how to use CloudWatch.

Odds are, some application tiers will be horizontally scalable. For most applications, this is a load balancer with machines behind it. This is marker four, five and six in the diagram. If you receive more traffic, just add more machines. If you receive less traffic, just remove them. AWS makes this easy with the off-the-shelf metrics in CloudWatch integrations. You should bake in autoscaling into your solution from the beginning. Start by creating autoscaling groups. These are the yellow scale-out arrows in the diagram. Remember to set the minimum and maximum size. You want to ensure that the minimum amount meets your traffic and availability requirements. Setting the maximum avoids continually scaling up into infinity. That's expensive and will most likely do more harm than good. Next, repair instance configuration for instances created by autoscaling triggers. How do you automatically configure which packages are installed in that machine or which version of the code to use?

This is a big topic that usually leads into configuration management or different kinds of automation strategies. There are many different solutions to this problem. You may opt to do everything with instance user data or use a prebuilt AMI or running a configuration management tool like Ansible after the instance boots. EC2 user data is the easiest thing to start with if you don't have any better idea. Lastly, configure CloudWatch to coordinate the entire process. Choose a scaling metric. This may be your CPU usage, request count, or a custom metric reported to CloudWatch.

You also need to determine the upper and lower breach thresholds. You can select a value and different calculation methods such as sum or average. You will also need to select the time required to do the trigger. Here's an example. The average CPU usage is greater than 85% for 10 minutes. Then connect autoscaling actions to each alarm. This defines how many instances to add or remove from the autoscaling group. You will need to tune these depending on your load patterns. Autoscaling reduces your overall cost and increases scalability. Who doesn't want these things? Let's see what else you can do with CloudWatch. CloudWatch provides a lot of useful data out of the box and I think this is the most powerful feature. Create an EC2 instance and inspect it in the EC2 console. You'll see graphs after a few minutes. That's CloudWatch data. You'll find basic data for all of your AWS infrastructure. The data is a great place to start, but it may not include everything that you need. Here's an example. CloudWatch reports EC2 CPU usage but not memory usage.

Here's another. CloudWatch reports elastic cache memory usage but not the total available memory. CloudWatch also doesn't provide math functions such as dividing two metric series. However, CloudWatch does provide percentiles on data points. This is how you can determine your P50, P90, or P95. This is especially useful for latencies. Metric data points may have multiple dimensions. Dimensions compare different metric streams. The active usage metric may use the page dimension to differentiate between the chat and my account pages. There's one big drawback here though. You can only compare metrics that share all of the same dimensions. CloudWatch recently bumped their data retention from two weeks to 15 months for no extra cost. This addressed a major drawback in the service. Now data is kept for a longer time with less granularity as you go farther back in time. This should be enough for the majority of users. CloudWatch uses APIs just like all of the other AWS services. Thus, you may use the AWS CLI or other relevant SDK to report custom metric data and dimensions back to CloudWatch. This is mandatory if you want to autoscale or alert on data not automatically provided by CloudWatch. You can use the APIs to fill in the gaps by calculating ratios, I'm looking at you memory statistics, and manually reporting them back.

You may also use the Collectd plugin to report data back to CloudWatch if that's your style. You also get alarms. You may also know these as alerts, but CloudWatch calls them alarms. Regardless of what they're called, the point is they need to get someone's attention. Here you have all of the power of SNS on hand. This gives you access to sending email, SMS, or even calling a lambda function to execute custom code. CloudWatch really does excel at alarms. If you just need to send an email to your on-call team then this is just the ticket.

Let's recap this lesson with a summary and recommendations on how to use CloudWatch. CloudWatch is great if you aren't sure what data you need. The out-of-the-box data is generally good enough to access what's missing for your particular requirements. Look for ways to fill in the gaps when you know what else you need. You can also use CloudWatch's dashboard for simple visualizations over your entire application. If you're already using something like or any other SaaS or metrics, then you're probably better off just sticking with that.

These options are more full featured and usually have integrations to automatically pull in data from CloudWatch so you don't need to re-implement anything. You'll need CloudWatch for autoscaling triggers regardless if you use it as your primary metric collection tool. Remember to use the AWS CLI or SDK to push in custom data as needed. If you opt to use CloudWatch as your primary metric collection tool, then you can use it for alarms as well. If you don't then make sure your tool can also ping you when data matches certain conditions. Remember, it's not production until it's monitored and it's not monitored unless there's someone there to actually handle what's going wrong.

All right, that's enough about CloudWatch for now. We touched on how autoscaling can save you money. The next lesson is all about other ways to save you money and keep your cost under control. AWS is not cheap these days so it's more important than ever. Join me in the next lesson where we talk about dollars and cents. Cheers.

About the Author

Adam is backend/service engineer turned deployment and infrastructure engineer. His passion is building rock solid services and equally powerful deployment pipelines. He has been working with Docker for years and leads the SRE team at Saltside. Outside of work he's a traveller, beach bum, and trance addict.