Exam Preparation: Domain One: Designing Resilient Architectures
In domain One, we learnt how elasticity and scalability help us design Cloud services and how AWS provides the ability to scale up and down to meet demand, rather than having to provision systems on estimated usage and how that ability increases our agility and reduces our cost 'cause we only pay for what we use. So, a stateless application stores no session information. It has no knowledge of previous interactions. Loose coupling involves breaking systems down to smaller independent components that act and react independently. So, if one component fails, the failure does not cascade to other components in the system.
A system is highly available when it can stand failure of one component or many components. If you design with the notion that a component could or will eventually fail, your system is unlikely to fail when that component does fail. Elastic architectures can absorb and support growth and traffic, the number of users or the data size that you're storing with a drop in performance. For best elasticity, systems need to be built on top of a scalable architecture and systems need to be able to scale in a linear manner. So, adding additional results in a proportional increase in the ability to serve and deliver on system load. So, any growth in resources should also introduce an economy of scale with costs reducing as that system scales. Vertical scaling results from an increase in the specification of a specific resource. So, adding more memory to a machine, for example. Vertical scaling will often hit a limit and it's not regard as the most cost efficient or highly available way to scale a system. Now horizontal scaling results from adding more resources and it's the preferred way to leverage the elasticity of the AWS Cloud infrastructure.
A key part of the solution architect associate brief is to be able to recognize how you might use AWS services together to create highly available fault-tolerant, scalable, cost-efficient solutions. So, we ran through the 10 AWS components that can help us design cost-efficient, highly available, fault-tolerant systems when used together and those were briefly, if you remember, Regions, AZ's which are designed for fault isolation. So, having multiple availability zones within one region can often provide a high level of durability and high availability without the need to use more than one region. If we do wanna extend our customers footprint to another region, that's also very possible to migrate AMI's and to migrate data services, et cetera, from one region to another. Virtual private cloud, which is that secure section of the AWS cloud. It gives us a CIDR block of between slash 16 and slash 28. The default VPC comes with subnets for your availability zones, an internet gateway, a default route table, a network access control list and a security group. A subnet is a public subnet if it has an internet gateway and a route in the route table to that internet gateway.
Then we looked at the elastic load balancer. It's a managed service, which detects the health of instances and routes traffic to the healthy ones. Now elastic load balancer adds another layer of availability and security as a managed service, ELB can terminate or pass through SSL connections. Then we had simple queue service that enables us to increase fault tolerance by decoupling layers, reducing dependents on service state and helping us manage communications between services. And of course, elastic cloud compute, EC2, that on-demand computing, those instance types available in various flavors on demand where you pay hourly. Reserved instances, where you pay either a one or three year partial upfront to reduce the cost of predictable usage patterns. Then we have Scheduled instances, which can be booked for a specific time of the day, week or month. In their idea, we have patterns of usage that are quite regular or reports that need to be done on a certain date every month or every year. Spot Pricing is marketplace pricing based on supply and demand basically where you're bidding and paying for unused excess AWS capacity. Often, it's a blend of those that can give you the best price. Now remembering that placement groups must be in the same availability zone and placement groups do not support micro or medium sized instances. Elastic IP addresses allow us to maintain service levels by swapping resources behind an elastic IP address. We can have up to five elastic IP addresses per region. With our elastic IP addresses, if you stop an instance, the elastic IP address remains associated with the instance and then, Route53, that powerful DNS service, we can manage our top level domains. It can provide graceful fail over to a static site in the event of an outage, which could be hosted in S3. It can do active/active, active/passive failovers based on elastic load balancer health checks or EC2 health checks and it can support weighted or geo tagger traffic distribution. Okay, so CloudWatch are the eyes and ears of our environment. Great monitoring tools, CloudWatch, CloudTrail, and AWS Config.
For Cloudwatch, we get basic EC2 monitoring enabled by default. Basic monitoring provides seven metrics at five-minute intervals and three metrics at one-minute intervals. Elastic load balancer has one-minute intervals by default. Detailed monitoring enables one minute intervals on the same metrics, but it comes with the charge, so you have to pay extra to use detailed monitoring. Cloudwatch also has things like an agent, which can send log files to Cloudwatch and so provide us more instance debugging and reporting information. Now, CloudWatch notifies of a change in state and the three reporting states are OK, ALARM, or INSUFFICIENT DATA. If an instance or ELB has just started, it would most likely return an insufficient data state.
Alright, auto scaling has three core components, the launch configuration, the auto scale group, and the scaling plan. So, the launch configuration is your template for what you want your machines to do when auto scale starts them and you can basically configure that machine to do exactly what you want with you launch configuration. The auto scale group is literally the group of services that are run inside that group and in the scaling plan, defines how services are added or removed from that auto scale group.
We saw how the four pillars of the AWS well architected framework can be a guide for designing with best practices. In security, we design to protect information, systems and assets while delivering business value through risk assessments and mitigation strategies. In reliability, we aim to deliver systems that can recover from infrastructural service failures and that can dynamically acquire computing resources to meet demand. In performance efficiency, AWS enables us to use computing resources efficiently to meet system requirements and to maintain that efficiency as demand changes and evolves. So, we need to be always looking for better ways to use services together and to look for ways to break monolithic stacks down to smaller, less dependent services. Then, cost optimization, our goal is to create the best possible outcome for our end customer. We need to avoid or eliminate unneeded cost or sub-optimal resources. Now, that may mean using smaller, more loosely coupled services rather than going straight for biggest and best available. We need to always be looking for ways to reduce single points of failure and to reduce costs. AWS has a global footprint, but we may not need to use the biggest instances in multiple regions and it may be that by using multiple availability zones within one region and by using a blend of On Demand and Reserved instances, we can create a highly available cost efficient solution.So, in exam questions, look for clues to help you determine the business requirements and contraints in any of the scenarios you get. Look for the Recovery Time Objective and the Recovery Point Objective. The Recovery Time Objective is the maximum amount of time the customer can be without the system in the event of a disaster. The Recovery Point Objective is the last possible point in time that the business data must be recoverable to. Now remember that the Recovery Point Objective is generally a time value as well. Now, redundancy can either be standby or active. When a resource fails with standby redundancy, the functionalities recovered from another resource using fail over, so fail over is likely to require some sort of time gap, which means your system may be unavailable during that fail over period. With active redundancy, requests are spread across multiple redundant resources. So, if one resource fails, the rest will absorb those additional requests and the system will continue. The benefit of active redundancy is that it generally achieves better utilization and it has a smaller blast radius in the event of failure. There are four design patterns we can deploy in AWS to meet RPO and RTO objectives. The first is backup and restore, which is like using AWS as a virtual tape library. It's generally gonna have a relatively high recovery time objective since we're going to have to bring back archives to restore first, which could take four to eight hours of logger. We're gonna have a generally high recovery point objective as well, simply because our point in time will be our last backup and if, for example, we're using daily backups only, then it could be 24 hours. Cost wise, backup and restore is very low and easy to implement. The second option is pilot light and that's where we have our minimal version of our environment running on AWS, which can be lit up and expanded to production size from the pilot light. Our recovery time objective is likely to be lower than backup and restore, as we have some services installed already and our recovery point objective will be since our last data snapshot and the third option is warm standby where we have a scaled down version of a fully functional environment always running in AWS. Now, that's gonna give us a lower recovery time objective and perhaps pilot light, as some services are always running and it's likely that our recovery point objective will be lower as well, since it will be since our last data write if we're using asynchronous databases with a master slave multi-AZ database service. The cost of running warm standby is near going to be higher than the pilot light or backup and restores. The benefit of warm standby is that we can use the environment for dev tests or for skunk works to offset the cost. And the fourth option is multi-site, where we have a fully operational version of our environment running in AWS or in another region and that's likely to give us our lowest RTO simply because it could be a matter of seconds if we're using active/active failover through route 53. Our recovery point objective, likewise, will be significantly lower than other options. If we're using synchronous databases, then yes, it could be a matter of seconds. If it's still using asynchronous databases, then we're going to be an RPO of a last data write. The cost and maintenance overhead of running a multi-site environment needs to be factored in and considered. One benefit is that you have an irregular environment for testing DR processes.
Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built 70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+ years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.