Designing for high availability, fault tolerance and cost efficiency
AWS Services That Enable High Availability
The AWS exam guide outlines that 15% of the SysOps Administrator–Associate exam questions could be on the topic of designing highly-available, fault-tolerant, cost-efficient, scalable systems. This course teaches you to recognize and explain the core architecture principles of high availability, fault tolerance, and cost optimization. We then step through the core AWS components that can enable highly available solutions when used together so you can recognize and explain how to design and monitor highly available, cost efficient, fault tolerant, scalable systems.
- Identify and recognize cloud architecture considerations such as functional components and effective designs
- Define best practices for planning, designing, and monitoring in the cloud
- Develop to client specifications, including pricing and cost
- Evaluate architectural trade-off decisions when building for the cloud
- Apply best practices for elasticity and scalability concepts to your builds
- Integrate with existing development environments
This course is for anyone preparing for the Solutions Architect–Associate for AWS certification exam. We assume you have some existing knowledge and familiarity with AWS and are specifically looking to get ready to take the certification exam.
Basic knowledge of core AWS functionality. If you haven't already completed it, we recommend our Fundamentals of AWS learning path. We also recommend completing the other courses, quizzes, and labs in the Solutions Architect–Associate for AWS certification learning path.
This Course Includes
- 11 video lectures
- Detailed overview of the AWS services that enable high availability, cost efficiency, fault tolerance, and scalability
- A focus on designing systems in preparation for the certification exam
What You'll Learn
|Lecture Group||What you'll learn|
Designing for High availability, fault tolerance and cost efficiency
Designing for business continuity
How to combine AWS services together to create highly available, cost efficient, fault tolerant systems.
How to recognize and explain Recovery Time Objective and Recovery Point Objectives, and how to recognize and implement AWS solution designs to meet common RTO/RPO objectives
|Ten AWS Services That Enable High Availability||Regions and Availability Zones, VPCs, ELB, SQS, EC2, Route53, EIP, CloudWatch, and Auto Scaling|
If you have thoughts or suggestions for this course, please contact Cloud Academy at firstname.lastname@example.org.
High availability, fault tolerance, and cost-efficiency are all fundamental concepts in the AWS design. It's rare we have the opportunity to deliver on all three. So let's ensure we understand each of these and the design decision compromises we may be required to make to design solutions that can deliver on all or some of these three states.
A core design principle inherent to each of these states is elasticity. Elasticity is the ability to scale up and down to meet requirements. You do not have to guess capacity when provisioning a system in AWS. AWS's elastic services enable you to scale services up and down within minutes, improving agility and reducing costs, as you're only paying for what you use.
AWS enables you to test, provision, and go live with solutions faster and more efficiently than with traditional IT infrastructure. Auto Scaling helps you maintain application availability and allows you to scale your capacity up or down automatically, according to conditions you define. You can use Auto Scaling to help ensure that you are running your desired number of Amazon EC2 instances.
The operating scale of AWS means you can keep adding resources as required to meet the demands of your users. A highly available environment aims to have minimal service interruption. A fault-tolerant environment aims to have zero service interruption. AWS provides inherent services that make it possible to design for high availability and fault tolerance. However, they need to be implemented correctly to create a highly available fault-tolerant solution.
Amazon's Simple Storage Service, Amazon Simple Queue Service, and Amazon Elastic Load Balancing have been built with fault tolerance and high availability in mind. Amazon Elastic Cloud Compute and Amazon Elastic Block Store provide specific features, such as availability zones, elastic IP addresses, and snapshots. However, you need to implement these services correctly to create a highly available system.
Exam questions will test your ability to identify the right mix of products to achieve a desired business objective. An AWS-certified system administrator needs to have a thorough understanding of the options, benefits, costs, and constraints when designing a solution on AWS. Now, the AWS global footprint is a real benefit for AWS customers. And of course, in sysops, it's tempting for us to want to leverage this, too, and to make every one of our solutions multi-region so that we can deliver that truly bulletproof solution. And you might find your first highly available fault-tolerant whiteboard design describes a full active-active solution with huge synchronous databases running in multiple regions. Wouldn't that be great, right? However, there are a number of cost and performance factors to consider.
Running synchronous services is generally quite expensive. And latency between regions and inter-region data transfer costs might not create the best outcome for your customer. So you need to be designing for cost optimization as well as performance at every point, to be always checking and re-checking that your design will, in fact, achieve the best outcome for your customer.
Designing for the cloud often means that biggest isn't necessarily best. And decoupling your services and reducing components into loosely coupled units that could run on smaller machines may improve performance and reduce single points of failure. More good news is that you don't necessarily need to run services across multiple regions to achieve a highly available solution. Availability zones are designed to provide fault isolation. So leveraging more than one availability zone within a region can provide very high levels of availability and fault tolerance. You might consider locating services in more than one region, where fault tolerance is the priority and cost is not the primary constraint.
For many solution designs, you may achieve an acceptable level of fault tolerance at a lower cost using multiple availability zones in one region. One of the key factors is determining just what the acceptable level of fault tolerance is. You need to ascertain if the customer can be without the system or not. If not, then you need to design for the highest durability possible. If an outage to the system is not actually mission-critical, then you need to work out how long the organization could be without the system. And when the system is back up, what is the last acceptable point in time to recover to? Now, that process is generally called business continuity planning.
The AWS well-architected framework provides architectural best practices across four pillars for designing reliable, secure, efficient, and cost-effective systems in the cloud. The framework provides a set of questions that allow you to assess an existing or proposed architecture and also a set of AWS best practices for each pillar.
Under security, we talk about the ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies. Under reliability, it's the ability of the system to recover from, say, infrastructural service failures, dynamically acquire computing resources to meet demand, and to mitigate disruptions such as misconfigurations or transient network issues. For performance efficiency, we're looking for the ability to use computing resources efficiently to meet system requirements and to maintain that efficiency as demand changes and technologies evolve.
Cost optimization, really we're looking at the ability to avoid or eliminate unneeded cost or suboptimal resources. Simply shifting an application to AWS will not necessarily immediately improve durability or availability. It's important to redesign your stack for the cloud at every opportunity so you gain maximum advantage from the scale and services AWS provides. You should look for ways to decouple services using Simple Queue Service to manage requests between layers, deploying Elastic Cache or Dynamo DB to store and retrieve decision data to reduce dependency on service state and increase performance. Perhaps optimizing business workflow using Simple Workflow Service or Lambda for serverless architectures.
If we take a design for failure approach, we might improve the durability and self-healing of our architecture. We might use Route 53 to support load balances and send traffic to multiple app servers that are using a replicated master-slave database. It may make sense to consider adding direct connect to reduce the risk of connectivity failure. And if you are presented with a single point of failure, it's most likely you will need to implement a service to address it. Many systems in organizations may be able to tolerate some level of service interruption.
Defining this acceptance point should be part of agreeing the recovery time, recovery point objective with your business stakeholders. Exam questions on these topics will generally provide you with the information you need to decide the level of availability and fault tolerance required. So ensure you read the questions carefully.
About the Author
Head of Content
Andrew is an AWS certified professional who is passionate about helping others learn how to use and gain benefit from AWS technologies. Andrew has worked for AWS and for AWS technology partners Ooyala and Adobe. His favorite Amazon leadership principle is "Customer Obsession" as everything AWS starts with the customer. Passions around work are cycling and surfing, and having a laugh about the lessons learnt trying to launch two daughters and a few start ups.