Concepts and Skills
Practical HA Design
Many businesses host critical infrastructure and technical business assets in the AWS Cloud. Yet, even with so much at stake in the AWS Cloud, many businesses neglect to ensure that their software systems stay online no matter what happens with AWS! In the CloudAcademy Advanced High Availability DevOps Video Course, you will learn critical technical and business analysis skills required to ensure customer can always interact with your cloud.
Watch and Learn:
- Why AWS isn't magic, and you should always plan your strategy with failure in mind
- Mental models for classifying business and IT risk in the AWS Cloud
- The "Big Three" model for increasing the availability of software systems in a methodical way
- Four possible ways to handle IT risk, depending on your needs
- Clear action items for surviving various types of AWS's outages, even entire region failures!
- How to walk through and design highly automated distributed REST APIs in 30 minutes or less
- Financial risk and cost assessment skills to sell the idea of investing in High Availability to key business stakeholders
- When to stop investing in High Availability due to diminishing returns and business needs
This course is essential for any current or future DevOps practitioner or Advanced AWS Engineer wanting to go beyond pure technical skills, and move to a business value and strategic decision making role.
If you have thoughts or suggestions for this course, please contact Cloud Academy at firstname.lastname@example.org.
Welcome back to the Advanced High Availability course for Amazon Web Services on CloudAcademy.com. So far we've talked about our availability introduction, and talked about basic concepts of the availability and understanding what it really means. We talked about different kinds of risks that we will see Amazon Web Services's system design. And we talked about advanced techniques that we can use to mitigate different kinds of risks technically at different scopes. In this lecture, we'll apply the things that we learned in the previous three lectures to do a demonstration of how we might plan for high availability of a new system or two.
So without further ado, let's get started. Okay so in this part of the lecture, we'll be talking about different ways that we can plan. So let's walk through a sample scenario where we're trying to build a RESTful JSON API. So I'm just using a text editor to write out some of our requirements so we can understand what we're trying to build.
Okay, so let's look at our set of requirements here, understanding that we are writing a JSON API that communicates with REST principles, fairly straightforward there. We need to be able to stay online through nearly anything so our business is telling us that we need to stay online even if there's an AWS regional failure. Maybe it's a financial transactions API or something and just needs to be available at all times. Should be able to scale nearly infinitely, so when we're looking at something like that we are making sure that we're not creating a system with any bottlenecks in it. We should be able to auto scale to reduce cost. So something that can scale as much as the pieces as possible during operation and should be able to scale up in the case of failure. Shouldn't have any manual steps to recover after failure so we need to have a system that, even if a region goes down, then if we have a second region that's running those transactions while the other region's down, the second region should be able to replicate its data over to the first without any human intervention after a failure. And it should have low latency as possible. So this sounds like a pretty tall order but this is regularly something that you might be asked to do as an advanced DevOps practitioner, and it's something that you need to be aware of as you're looking at doing advanced high availability techniques.
So let's look at each of these by themselves. So JSON API that communicates with REST principles, we can do that in just about any scripting language that we want, but we understand that we need at least a database and an API or application layer that has our business logic. So we're going to have at least two layers here. We also need to be able to stay online through regional failures, so we need to be looking at something like a multi-region deployment. This is a guaranteed requirement since we have specifically said that it needs to stay online throughout AWS regional failures. We need to be able to scale nearly infinitely, so we should be creating a system that will scale horizontally very well, as scaling upwards vertically isn't going to work well at all, either for the advanced availability requirements or for the scaling. And we should be able to auto scale to reduce cost, so we should be looking at things that are able to resize themselves at the application and perhaps even database layer, we might find out. And we need to be able to recover after these failures without having any other problems. So if we recall from our previous lectures, these would include techniques like a write buffer queue for a multi-master database. And we did have as low latency as possible, so this interestingly enough, if we are looking at having as low latency as possible, means that we need to be able to read and write to the database from multiple regions. So if we're doing that for multi-region, we need to be looking at a multi-region, multi-master cluster database, or some sort of database that has those same behaviors. So this is a very tall order, but we can do it if we plan through this iteratively.
So let's walk through the multiple layers at which we might be able to do this and see if we can leverage AWS specific services to reach these goals without too much development cost. So first, let's think about a JSON API that communicates with REST principles. So my first thinking here is that when I'm looking at this is that I want to use DynamoDB rather than RDS or a SQLite database, simply because DynamoDB already natively communicates in REST. It's a very easy database to configure and manage, and if I look further down, I am actually able to scale DynamoDB nearly infinitely since Amazon handles that for us, as well as, well, auto-scaling does not, as of the recording of this video, support auto-scaling. We can actually create systems that will do the auto-scaling for us with a little bit of extra logic. So I'm going to, for this set of requirements, think DynamoDB immediately.
So if we're looking at this second requirement here on line four, we need to be able to stay online through nearly everything including regional failures. We'll need to somehow create a system that we'll be able to stay online even if we're using DynamoDB, we should be able to go multi-master, multi-region to meet the rest of the requirements, including the low latency. That's why we need to be multi-master rather than just active/passive with a failover, because that won't give us a low latency in the second region. It's only used for disaster recovery. So let's take a look at how we might build something like this.
First I'm going to look at doing this for a simple system, where we're looking at just making sure what would this look like if we were doing a naïve version of this that didn't stay online at all? So naïve and I'm using a tool called Lucidchart. It's just a flow charting tool. You could do this with pen and paper, but of course I need to screen share with you all. So let's pull up the icon for DynamoDB. If I was doing a very naïve system, I would probably just do something like this where I have Route 53 pointing to an instance, which points to DynamoDB. And I call it a day for my reads and writes. So this might be in a single region, and of course, we need to go into our DynamoDB. And since we're doing a naïve system, we would only build this out in one region, which would look something like this. Just a single AWS cloud or region.
Now the first thing I should be looking at if I come across a system like this, given these requirements, is where do we fail here? Do we meet number one? Yes, we could create this with any scripting language we wanted on the instance. And DynamoDB is very good at handling JSON, so this would actually work for a system that needed to be very simple. But we don't meet number four at all. We don't meet number five at all because we're not scaling this compute tier here. We don't have any auto-scaling. We do have manual steps to recover after failure because we will drop transactions, and users will have not been able to communicate with our service during off time. And we will have high latency anywhere that's not close to the primary data center, wherever the single region is. So this isn't good. We need to figure out how to make this better.
So let's think about how to make this auto-scaling first. So primarily here, we're looking at being able to scale infinitely and auto-scaling. So what would that look like? If we're still looking at a single region, we can meet a couple more of our requirements by creating an auto-scaling group that triggers on load. Route 53's already good at this, but we need to add some extra componentry in between Dynamo and Route 53. Rather than a single EC2 instance that we might have just done, we need to add an auto-scaling group and an elastic load balancer. So we're still getting traffic from the internet, we're still in a single availability zone, but now we've added a little bit of extra complexity here where Route 53, rather than routing to an EC2 instance, will be routing to a load balancer, which in turn will be hitting one of these auto-scaling containers, which should have an associated set of instances that are controlled by auto-scaling logic.
So this meets an extra requirement here, where now I'm actually meeting the ability to scale nearly infinitely. I still don't have my auto-scaling for my DynamoDB because this doesn't support auto-scaling. However, I can if I manually go and increase or decrease the amount of provisioned read/write throughput, scale this manually. So I've now met a couple extra requirements. I've pretty much met this one, since I can scale nearly infinitely with an elastic load balancer in a set of instances in an auto-scaling group and a DynamoDB table. I may need to ask for extra provisioning overhead from Amazon, since I will cap out at 10,000 reads and writes from a single table in a single region. That's just an Amazon Web Services soft cap, but you can request limit increases if your use case actually requires it. So this is pretty good. This is how most people deploy their systems. However, it's missing all its sophistication that help us meet all of our other requirements.
So let's add another iteration of how we might think about designing for these requirements with high availability, auto-scaling with DB. So another reason that we might want auto-scaling on the database beyond just cost mitigation is that it helps us with our disaster recovery. For instance, if we are in a scenario where I'm going to need to switch over to the secondary region or shift all traffic from a single region deployment or from a dual region deployment, rather, into a single region, I'll need even the database layer to automatically scale itself to be able to handle the increased load that was previously distributed across two installations on two different regions. So how might I do something like this? Well, as we all know, we have Amazon Lambda as a new entrance to the ecosystem. I have a custom icon here, so bear with me while I resize it. I may need to have something like an Amazon Lambda operating over something like a CloudWatch alarm. So if DynamoDB throughput exceeds some alarm value, we can trigger a Lambda which scales the DB. So we might have our monitoring metrics feed into something like a CloudWatch or whatever monitoring system of our choice. Cloudwatch is a nice hosted one. Sends an event to an Amazon Lambda, which then in turn scales the DB. So this would give us actually within a single region. We don't have our high availability requirement yet, but we've set up the groundwork for it since our database can now scale, and if we have a failover scenario, we'll need the database that has failed over to to scale quickly. But now we have auto-scaling on every layer of the stack, which is really, really good. So let's move on and realize that we've effectively gotten these two. And of course, we had that one from the beginning.
So now we need multi-regionality, we need no manual steps for recover, and we need low latency. So let's start with multi-regionality. So if I'm looking at this system, and I'm trying to think of how I might design it to be a multi-region one. The first instinct that I should have is to look at this region as a single unit, realize that I can draw on the techniques that we talked about in the previous lectures, and use a latency or failover DNS to Route 53 and place Route 53 as a global resource outside of our single region, since Route 53 operates worldwide as the DNS service must. We might get internet traffic from two different regions and we can do latency and failover DNS to achieve something naïve like this where I might say we've a single master zone where this system is simply used as a backup while we are operating normally.
So if you look at this, all I've done is replicate across regions here the same behavior that we had before. What I need to do though is have some sort of synchronization tool working to keep the databases in the same state, so I'm missing that portion. Let's design that in. One way that I might do that is by using a stream of events. We might have some items come across into a stream because DynamoDB now supports streams off of our master here. Load things in there. Our stream will then be read by a Lambda function which can perform cross-region writes directly into the database table.
So something like this would actually work for our multi-region system. So if we can imagine we have this going on during a normal functionality. We have an elastic load balancer. We hit our instances. We have auto-scaling groups. The database itself will scale up and down, depending on its usage. And we're actually streaming changes into the other DynamoDB table. So what this gives us is that when we have a failure, we are switching this dotted line, so this would be our failover traffic. So we can fail the DNS over to the other ELB, and these will scale up. So this is actually a pretty good system in that it will actually stay online relatively well in the event of a full region failure. We might see about 5 minutes, 10 minutes, 15 minutes of lag though, as my AMIs are copied down into the instances and the group spans out beyond a single instance, which is our cool standby. And as our DynamoDB table goes from however many write units were happening as we were streaming changes into it into full-fledged read and write database that is servicing all traffic. So that's good. And we like the single failover. However, while we have met this one, we still have not met these two other requirements.
So to build a really, really good high availability system that's scalable, we need to also add this last requirement here first, which is low latency. So this would be use a multi-master setup. So we want multi-master, multi-region because if we notice in normal operations for this simple, air quotes, simple multi-region deployment, since all reads and writes are going to the same data center usually if this one is based in Virginia but we have people reading and writing in Singapore for instance, somewhere far away on the other side of the world, we may need to also service requests from this secondary data center using multi-master. So let's see what that looks like. Rather than doing this failover, we need to get rid of this. Treat the traffic line or traffic model to be servicing our closer requests and doing active/active. Now we can actually do something extremely similar to what we had before here by realizing that we can take the entire model and create the replication step here as a first class citizen. So rather than doing this unilaterally, we can do it bilaterally, and perform the same operations in two directions.
So now all we've done here is change the system so that we've got these streaming Lambda writes going in both directions as we're doing read and writes out of the database. So the key here is use a right conflict mechanism. If you can imagine if I perform a write to the same resource from two different regions, then we've got this asynchronous replication behavior going on in both directions. Then the two databases could try to overwrite each other's writes. We just need to make sure that we're using the highest priority writer, which we will assume is the last write to modify a specific object. So if I try to modify an object at a specific time, then that timestamp will appear on the object and when I'm doing these asynchronous replications, I can decline the write from the other region if I see a more recent stamp. And that way the behavior is consistent and we are certain that the two sides of this brain are consistent over time.
So this actually meets our low latency requirement, because now at multiple regions, we can service the closer request using the closest data center. We've got auto-scaling at the API layer, and we've got failover at the API layer because the ELB will do failover for you. And DynamoDB actually already has built-in failover, and within region availability's own replication for high availability. So we've actually met all of our requirements except for this last one which is dealing with that nasty edge case where, or corner case, where we have manual steps to recover.
So if we imagine if we black this data center out for down time, then this system will be humming along and this DynamoDB instance will be fine. But note that this Lambda function, or whatever replication strategy we're using, you can use an EC2 instance with a script on it as well. This instance will have some backed up state and you'll need to keep track of the changes that occurred while the second data center was offline. So we'll need to use this same right conflict mechanism to perform the backed up writes to the single region that was still online during failure, since we'll have piled up a lot of events in this stream that need to be propagated out to the other region when the other region comes back online.
As you can guess, the technique that we'll be using to prevent that nasty read/write desynchronization problem is rather than going DynamoDB to stream to Lambda, we can do something like...markedly more complicated system than the original naïve version. We'll actually meet all of our requirements. Because instead of doing direct writes from a Lambda or trying to directly emit or synchronize events from one site to the other, we're buffering into a queue so that if there are any issues and we are unable to emit data back into the region at any point, we have a queue which will store all of the writes that we attempted, and then later be de-queued by a periodically executing Lambda function, or some other mechanism. The queue itself doesn't actually need to be SQS here. However, that's a good service for it. You could actually use another DynamoDB table as the temp table for all the object that need to exist during the replication phase. But the idea here is the same, that we have the ability to do auto-scaling. We have the ability to do all of these things.
Now we can actually add a little bit more high availability if we, in addition to servicing these requests, we also put Amazon CloudFront in front of each region. CloudFront being another global resource. We just think of it as mounting Route 53 on top of a correctly configured and dynamically cached CloudFront application. Then in the event of downtime, all of my reads will be served really, really quickly anyway, and scalably. So if we have user uploads, then we can be serving the uploads out of multiple buckets. Right now, Amazon is only going to support one-directional regional replication. However, we can also implement Lambda logic if we want to, which will allow us to do bi-directional regional replication.
So if you can imagine, now we have a system that will survive pretty much any outage, because all of the components inside of each region are availability zone failover safe and instance failover safe, since DynamoDB manages all of that for you already. And an elastic load balancer paired with a correctly configured auto-scaling group can handle single instance, or even availability zone failures if cross-zone load balancing is enabled. So that's great. We're extending the notion of that load balancer and instances into a similar notion where Route 53 is analogous to the elastic load balancer, and each of these regions that is actually supporting and servicing requests is analogous to this set of instances.
So if we imagine if I pretend to delete an entire side of the data center, then of course, we have a fully operational second region in which we can execute the request. It's important to note that with this level of complexity that needs to stay in sync, both from an infrastructure and from a code level, we should be using CloudFormation to deploy each of these regions. This will allow us to get consistency that we wouldn't otherwise get if a human was handcrafting these entire regions and such. Cloudformation lends itself extremely well to doing this kind of thing because you can actually create latency records that join a group of records into Route 53 as one of your resources when creating a CloudFormation template.
So if we look back at our requirements, we've actually met all of our requirements now. Since we no longer have any manual steps to recover because we're used to actually operating over a queue based write cross-region. So let's take a look at the progression of our complexity here. We started with native implementation, how a lot of first time in the cloud companies will design their systems, not realizing that Amazon can and does go down, and that there are sometimes problems with instances. We have created a lot of single points of failure here, and we don't meet any of our fast requirements around latency and those other non-availability requirements.
So to move towards our availability and latency requirements, we create an auto-scaling group with instances in it that is addressed by a load balancer so that Route 53 points at the load balancer instead. We then added auto-scaling to DynamoDB even though it's not supported, by using a feedback loop from CloudWatch monitoring metrics on the data for the table usage that trigger Lambda event based on some sort of logic, and then scales the database up or down whenever appropriate. Then we decided that that wasn't enough from high availability standpoint, and that we needed to go multi-region, since that was the requirement that was handed to us from the business. We did so by creating an active/passive system where all of these instances in DynamoDB are scaled to almost zero, while we're doing data synchronization and just hoping and praying that this left hand primary single master data center doesn't go down. However, since the failover to go from active/passive takes long and because we don't get the latency benefits from having request service from a secondary data center, but we're also still incurring significant cost from a development/time standpoint, and from a cost standpoint. That doesn't make much sense. We want to leverage this side of the universe in a way that's more positive. So we look to a multi-region deployment that is also multi-master. So in addition to the primary or single master side, we actually service some closer requests using the secondary data center, and creating a master/master database setup by using two DynamoDB tables leveraging the change stream that comes out of those DynamoDB tables that we can subscribe to, and causing a Lambda that is subscribed to the stream to replicate into the other region.
However, this still doesn't meet our requirement of low maintenance high availability where we automatically recover from failure. So we have to add a queue and some extra caching around our storage and such. And then we should also add CloudFormation to make sure that we have consistency between our environments. The key concept here for data synchronization is that we have a stream, a Lambda, in queuing the values somehow, and then another Lambda reading off of it and trying to replicate the change out to the other region, as well as the feedback loop for failed replication writes across regions.
So this an example of how we might apply our DevOps knowledge. Walk through with concrete reasoning based on the different requirements and understand how to author a architecture diagram or plan for creating such a system.
So this ends our planning demonstration. Hopefully you've learned a thing or two about how to plan for high availability when working with Amazon Web Services, and applied the knowledge that we talked about in our previous three lectures.
Next we'll be doing a costing demo where we assess what the cost of failure is and what the cost of continuing to pursue high availability is in an existing system.
About the Author
Nothing gets me more excited than the AWS Cloud platform! Teaching cloud skills has become a passion of mine. I have been a software and AWS cloud consultant for several years. I hold all 5 possible AWS Certifications: Developer Associate, SysOps Administrator Associate, Solutions Architect Associate, Solutions Architect Professional, and DevOps Engineer Professional. I live in Austin, Texas, USA, and work as development lead at my consulting firm, Tuple Labs.