Designing Highly Available, Cost Efficient Cloud Solutions
Developing Cloud Solutions
In this course we will:
Understand the core AWS services, uses, and basic architecture best practices
Identify and recognize cloud architecture considerations, such as fundamental components and effective designs.
Elasticity and Scalability
Regions and AZ's
Amazon Elastic Load Balancer
Amazon Simple Queue Service
Amazon Elastic IP Addresses
Amazon Auto Scaling
High availability, fault tolerance and cost efficiency are all fundamental concepts in AWS design. It's rare we have the opportunity to deliver on all three. So let's ensure we understand each of these, and the design decision compromises that as developers, we may be required to make to design solutions that can deliver on all or some of these three states. A core design principle inherent to each of these states is elasticity. Elasticity is the ability to scale up and down to meet requirements. You do not have to guess capacity when provisioning a system in AWS. AWS' elastic services enable you to scale services up and down within minutes, improving agility and reducing costs, as you're only paying for what you use. AWS enables you to test provision and go live with solutions faster and more efficiently than with traditional IT infrastructure. Auto scaling helps you maintain application availability. It allows you to scale your capacity up or down automatically, according to conditions you define. You can use auto scaling to help ensure that you are running your desired number of Amazon EC2 instances. The operating scale of AWS means you can keep adding resources as required to meet the demands of your users. A highly available environment aims to have minimal service interruption. A fault tolerant environment aims to have zero service interruptions. AWS provides inherent services that make it possible to design for high availability and fault tolerance. However, they need to be implemented correctly to create a highly available fault tolerant solution. Amazon Simple Storage Service, Amazon Simple Queue Service, and Amazon Elastic Load Balancing have been built with fault tolerance and high availability in mind. Amazon Elastic Cloud Compute and Amazon Elastic Block Store provide specific features such as availability zones, elastic IP addresses and snapshots, however, you need to implement these services correctly to create a highly available system. Exam questions will test your ability to identify the right mix of products to achieve the desired business objective. An AWS certified developer needs to have a thorough understanding of the options, benefits, costs and constraints when designing a solution on AWS. Now the AWS global footprint is a real benefit for AWS customers. And of course as developers, it's tempting for us to want to leverage this too, and to make every one of our solutions multi-region so that we can deliver that truly bulletproof solution. And you might find your first highly available, fault tolerant whiteboard design, describes a full active active Solution with huge synchronous databases running in multiple regions. Wouldn't that be great, right? However, there are a number of cost and performance factors to consider. Running synchronous services is generally quite expensive, and latency between regions and inter-region data transfer costs might not create the best outcome for your customer. So you need to be designing for cost optimization as well as performance, at every point, to be always checking and rechecking that your design will in fact achieve the best outcome for your customer. Designing for the cloud often means that biggest isn't necessarily best. And decoupling your services and reducing components into loosely coupled units that could run on smaller machines may improve performance and reduce single points of failure. More good news is that you don't necessarily need to run services across multiple regions to achieve a highly available solution. Availability zones are designed to provide fault isolation, so leveraging more than one availability zone within a region can provide very high levels of availability and fault tolerance. You might consider locating services in more than one region where fault tolerance is the priority and cost is not the primary constraint. For many solution designs, you may achieve an acceptable level of fault tolerance at a lower cost using multiple availability zones in one region. One of the key factors is determining just what the acceptable level of fault tolerance is. As a developer, you need to ascertain if the customer can be without the system or not. If not, then you need to design for the highest durability possible. If an outage to the system is not actually mission critical, then you need to work out how long the organisation could be without this system, and when the system is back up, what is the last acceptable point in time to recover to. Now that process is generally called business continuity planning. The AWS well architected framework provides architectural best practices across four pillars, for designing reliable, secure, efficient and cost effective systems in the Cloud. The framework provides a set of questions that allow you to assess an existing or proposed architecture, and also a set of AWS best practices for each pillar. Under security we talk about the ability to protect information, systems and assets while delivering business value through risk assessments and mitigation strategies. Under reliability, it's the ability of the system to recover from, say, infrastructural or service failures, dynamically acquire computing resources to meet demand, and to mitigate disruptions such as misconfigurations or transient network issues. For performance efficiency, we're looking for the ability to use computing resources efficiently to meet system requirements and to maintain that efficiency as demand changes and technologies evolve. Cost optimization, and really we're looking at the ability to avoid or eliminate unneeded cost or sub-optimal resources. Simply shifting an application to AWS will not necessarily, immediately improve durability or availability. It's important to redesign your stack for the cloud at every opportunity, so you gain maximum advantage from the scale and services AWS provides. You should look for ways to decouple services using Simple Queue Service to manage requests between layers. Deploying Elasticache or DynamoDB to store and retrieve session data to reduce dependency on service state, and increase performance. Perhaps optimizing business workflow using Simple Workflow Service or Lambda for server-less architectures. If we take a design for failure approach, we might improve the durability and self-healing of our architecture, we might use Route53 to support load balances that send traffic to multiple app servers that are using a replicated master slave database. It may make sense to consider adding Direct Connect to reduce the risk of connectivity failure. And if you are presented with a single point of failure, it's most likely you will need to implement a service to address it. Many systems and organisations may be able to tolerate some level of service interruption. Defining this acceptance point should be part of agreeing the recovery time, recovery point objective with your business stakeholders. Exam questions on this topic will generally provide you with the information you need to decide the level of availability and fault tolerance required. So ensure you read the questions carefully.
About the Author
Andrew is an AWS certified professional who is passionate about helping others learn how to use and gain benefit from AWS technologies. Andrew has worked for AWS and for AWS technology partners Ooyala and Adobe. His favorite Amazon leadership principle is "Customer Obsession" as everything AWS starts with the customer. Passions around work are cycling and surfing, and having a laugh about the lessons learnt trying to launch two daughters and a few start ups.