How to design high availability and fault tolerant architectures
Designing solutions for elasticity and scalability
The course is part of this learning path
Designing Resilient Architectures. In this module, we explore the concepts of business continuity and disaster recovery, the well-architected framework and the AWS services that help us design resilient, fault-tolerant architectures when used together.
We will firstly introduce the concepts of high availability and fault tolerance and introduce you to how we go about designing highly available, fault-tolerant solutions on AWS. We will learn about the AWS Well Architected Framework, and how that framework can help us make design decisions that deliver the best outcome for end users. Next, we will introduce and explain the concept of business continuity and how AWS services can be used to plan and implement a disaster recovery plan.
We will then learn to recognize and explain the core AWS services that when used together can reduce single points of failure and improve scalability in a multi-tier solution. Auto Scaling is a proven way to enable resilience by enabling an application to scale up and down to meet demand. In a hands-on lab we create and work with Auto Scaling groups to improve add elasticity and durability. Simple Queue service increases resilience by acting as a messaging service between other services and applications, thereby decoupling layers, reducing dependency on state. Amazon Cloudwatch is a core component of maintaining a resilient architecture - essentially it is the eyes and ears of your environment, so we next learn to apply the Amazon CloudWatch service in a hands-on environment.
We then learn to apply the Amazon CloudFront CDN service to add resilience to a static website that is served out of Amazon S3. Amazon Cloudfront is tightly integrated with other AWS services such as Amazon S3, AWS WAF and Amazon GuardDuty making Amazon CloudFront an important component to increasing the resilience of your solution.
Hello, and welcome to this lecture, where I shall provide an overview of the Amazon CloudWatch service.
The primary function of Amazon CloudWatch is to provide a means of monitoring your resources that you're running within AWS via a series of metrics which are individual to each service that you are using. This allows you to quickly react to events, and diagnose, and dynamically adjust any availability or scalability issue that you might be experiencing.
Each service and resource sends data to your CloudWatch dashboard as metrics. These metrics vary depending on the service. For example, when monitoring EC2 you could have metrics such as CPUUtilization and NetworkIn. Whereas, for the S3 service your metrics could be BucketSizeBytes and NumberOfObjects. The metrics are very dependent, as each service is used differently, and as such contains different metric variables. Amazon CloudWatch also offers the ability of creating custom metrics for your applications if you need to measure specific components of your infrastructure.
CloudWatch offers you two modes of recording your metric data, these being basic monitoring and detailed monitoring. Basic monitoring is the default monitoring type when configuring Amazon CloudWatch, which records metrics every five minutes. However, if you want to do a more precise timeframe for your monitoring, then you would use detailed monitoring.
Detailed monitoring for instance types ensures the metric data is recorded at one minute intervals, as opposed to five minutes with basic monitoring. However, detailed monitoring comes at an additional cost. It's important to remember that any data captured by Amazon CloudWatch is retained for two weeks, even if your AWS resources have been terminated.
It's great that CloudWatch is constantly monitoring the environment, but we need to create alarms to respond to events that occur within your environment and across your resources. You should think of alarms as predefined thresholds. Let's take a look at an example of how you could use these alarms. Let's say we have an application server, and when this server reaches 90% CPU utilization, a sub-process in our customer application stops responding. As CloudWatch already tracks CPU utilization we can use this information by creating an alarm to notify us of the impending disaster.
As a result, we could configure an alarm to trigger at 75% CPU utilization. Where the response of this trigger alarm would be to launch another server to even the load. This alarm could have an auto scaling action assigned to it to launch its additional server automatically for you. Or the alarm could be configured to send you a message when the alarm is triggered via the simple notification service, SNS.
An alarm has three possible states, the first one being 'OK'. This simply means that the metric associated of the alarm is within the predefined threshold. In our example on the previous slide, a CPU usage of 50% would have an 'OK' state, as it had not reached the 75% threshold limit.
'Alarm', this status means that the metric is outside of the threshold level, and the alarm is activated. Therefore, in our example, a CPU usage of 75% would have triggered the alarm.
And, finally, 'insufficient data'. This indicates that the metric has not collated enough available data to determine the alarm state. This is usually the case when the alarm has just been configured, and when you are waiting for metric data to be sent to CoudWatch.
CloudWatch also provides a repository for logging. This is a very effective way to capture all of the logs across your application and web servers. For example, let's say you manage a popular WordPress site with 12 front end web servers. You then experience an unexpected event which requires you to view the logs. Wouldn't it be nice to simply go to a single place to look at the system and log data of all your servers? With CloudWatch logging you can direct all of the service and applications to send their logs to CloudWatch, allowing you to review them from a single place. You can then also export the logs, and use your favorite third-party tool to perform additional detailed analysis.
By utilizing Amazon CloudWatch and its available features, it gives you the ability to ensure the following points. That your customers are receiving a good end user experience. That you understand how changes are impacting the overall performance of your environment. You are scaling your environment efficiently. Effective root cause analysis of service interruptions, and you can identify how to resolve future problems.
That now brings me to the end of this short lecture providing an overview of Amazon CloudWatch.
About the Author
Head of Content
Andrew is an AWS certified professional who is passionate about helping others learn how to use and gain benefit from AWS technologies. Andrew has worked for AWS and for AWS technology partners Ooyala and Adobe. His favorite Amazon leadership principle is "Customer Obsession" as everything AWS starts with the customer. Passions around work are cycling and surfing, and having a laugh about the lessons learnt trying to launch two daughters and a few start ups.