Concepts and Skills
Practical HA Design
Many businesses host critical infrastructure and technical business assets in the AWS Cloud. Yet, even with so much at stake in the AWS Cloud, many businesses neglect to ensure that their software systems stay online no matter what happens with AWS! In the CloudAcademy Advanced High Availability DevOps Video Course, you will learn critical technical and business analysis skills required to ensure customers can always interact with your cloud.
Watch and Learn:
- Why AWS isn't magic, and you should always plan your strategy with failure in mind
- Mental models for classifying business and IT risk in the AWS Cloud
- The "Big Three" model for increasing the availability of software systems in a methodical way
- Four possible ways to handle IT risk, depending on your needs
- Clear action items for surviving various types of AWS's outages, even entire region failures!
- How to walk through and design highly automated distributed REST APIs in 30 minutes or less
- Financial risk and cost assessment skills to sell the idea of investing in High Availability to key business stakeholders
- When to stop investing in High Availability due to diminishing returns and business needs
This course is essential for any current or future DevOps practitioner or Advanced AWS Engineer wanting to go beyond pure technical skills and move to a business value and strategic decision making role.
If you have thoughts or suggestions for this course, please contact Cloud Academy at email@example.com.
Hello and welcome back to CloudAcademy's Advanced High Availability course for Amazon Web Services. This lecture today, we'll be talking about advanced techniques that we can use to engineer high availability systems, including the big three focus areas. These are the big three criteria for how to make a system highly available and reduce single points of failure, etc. We'll also talk about chaos games, which are also called war games in other industries, but these are ways that we can simulate failures or actually run real failures during our staging or production environments to test the efficacy of our systems. We can also talk about de-risking our people. So in addition to our actual programming or infrastructure design, we have DevOps processes that we need to de-risk because human beings can be sources of risk as well. We'll also talk about deployment risks. Deployments risks are interesting in that they are very common because this is usually where lots of change is happening in our system and we're very hands-on. So we need to talk about ways that we can mitigate the risk of something bad happening as we launch new code or new infrastructure. We'll also talk about server risk. This is risk associated with running EC2 instances that occurs at the server or instance level, including disk, CPU, network, etc. We'll talk about zone risk or availability zone risk. These are risks that occur at the availability zone level. We'll also need to go into depth on how to mitigate the chance that an availability zone failure affects the overall availability of your business's IT infrastructure. And finally, we'll talk about the biggest disaster of them all in Amazon, which is when a region goes out. It's actually pretty uncommon, but it's happened before, so we need to talk about it.
Okay, so the big three. These are the three areas that we need for system engineering on high availability systems. Firstly, we need to eliminate single points of failure. That should be pretty straightforward, and if you've done any kind of high availability yet on Amazon, you'll be familiar with this. It's the technique that you use when you say, add a second server behind a load balancer, or particularly within a second availability zone. But there's an implicit part there that we need to think about, which is knowing when to crossover or failover. So when a server, zone, or region goes down, how do we know to shift traffic to the stand-by or the second master, or whatever solution that we pick? And finally, we need to detect failures quickly. That is, even if we have the system fail itself over during a disaster scenario, and our end users don't perceive anything, we want our operations team or our DevOps process to know about failure so we can do a post disaster assessment, and eventually repair the system to be ready and back to the first initial state before we did any failovers. So again, eliminate single points of failure. Failover across those points of failure that are now multiple reliably, and detect said failures quickly, so you can close the loop and improve your process in an ongoing manner.
Okay, so real quickly let's talk about chaos games. The best way to test availability techniques is by forcing failures. Seems a little bit obvious, but this is taken to the extreme by Netflix's Chaos Monkey. I chose this as an example because it's been very popular. It randomly kills EC2 instances so you can always be sure that your process is working. So you would deploy something like this into staging, ensure that it works, and then deploy it into production, so it will keep you on your toes, right? You'll know that you're engineering things for high availability if the entire time that you've been running things in staging, and later production, you've been deleting instances and you've designed your application such that you can handle those failures. Then the likelihood that you are taken down by some freak accident or random event is much, much lower because you have prepared your infrastructure for those events. So I highly suggest doing war games or chaos games where you intentionally simulate failures and keep everybody on your toes.
So let's get into how we'll actually de-risk these things. So the hard truth is that people fail too. So we need to talk about de-risking our people, or our human capital first. These are your key players on your IT team. They can be the biggest sources of risk, even bigger than technical issues. Your team members could quit, die, or get sick. Now this is pretty morbid, but it's important for business to understand that they could quit, die, or get sick, right? So ex-employees could also be upset and do damage. So we already talked about some of this risk in one of the previous lectures, but we need to realize that this is a very, very real consequence in that we need to view this from the lens of not only HR, but also IT, and implement processes that'll prevent these things from happening. Someone could sleep through an alarm. Markedly less tragic or malicious, but still nevertheless, this is pretty critical if an alarm goes off and no one's awake. Of course, people make mistakes in the console. So the last two ones here are neither malicious nor tragic, but they're pretty common and you can see how these kind of issues would happen if somebody's asleep at the wheel, so to say.
So how are we going to de-risk our people? Well, we're going to remove our single human points of failure, just like we learned in those big three. Two plus people should have access to the root account under multi-factor authentication. This is so critical. So many people that I talk to when I'm consulting forget that if they have a single person with the root token, or the root MFA, or the root password, that it's very difficult potentially to recover those Amazon Web Services accounts. It would be a shame if somebody was unavailable with the root account and they needed to, say, upgrade a support plan on short notice. It would be really, really not good if somebody was sick, or left the company, or died, that had the only MFA token for a major production infrastructure deployment.
Businesses also need to actively monitor their IAM credentials and SSH. Specifically SSH private keys and pairs. You need to remove old users. This is incredibly critical. So many businesses that I work with forget to remove their old users. I still have administrator access keys and secret access keys for some of my clients that I haven't worked with for over a year. Fortunately, I am a happy customer for them, and they have never had any problems with me, but you can imagine if you are not following this process and you have a less than graceful departure with somebody, you can imagine this is a huge security risk, which of course, could affect your availability of your systems. Have an on-call rotation if possible, if you have enough people. Or pay a team to do 24/7 disaster recovery. Barring this, you should at least have somebody sleeping with the phone, if you're in super early stage seed level start up where you might only have one person that understands how Amazon works. But realize that if that person doesn't wake up, or of it's the weekend and they've had a couple drinks, you might be high and dry if anything happens. There may be no way for you to recover if you only have a single person and they're out of commission for whatever reason.
You should also automate everything possible and remove any manual step. So this is so important for two reasons. One, that having an on-call rotation or paying for a team can be mitigated to a certain extent if you have, say, one very skilled DevOps engineer that understands how to do heavy duty Amazon automation. If they do it correctly, then developers that don't have so much experience with Amazon can just press a button to run the script and automate repairs, disaster recovery, redeployments, etc.etc. So that's one. It reduces the risk of that single person that might not be available and documents everything. And number two is that humans make mistakes. Automated steps are repeatable. They don't really make mistakes unless there's a bug, but you can test software. You can't test if somebody's groggy and they're running manual steps, and they forget a piece, etc.etc. Auto deploys are much more secure because they're less likely to have transient issues than a human who may be tired or whatever else situation might crop up that caused them to do not so good deploys.
Of course, document all processes. That is, if somebody decides to leave the company suddenly, you may not be able to retain their knowledge even for their two week notice. How will you know how to redeploy software if you are not documenting things, and you had somebody just write a one-off script? You have to go and reverse engineer the way the things are deployed, and spend a bunch of time rebuilding all of that intellectual capital. So make sure to document all processes, including deployment, recovery, anything that goes into the Amazon Web Services technical infrastructure.
Okay, so we also have deployment risk that we need to talk about mitigating. The majority of outages are due to deploys. So if you go and look up any kind of statistic on deployment and outage errors, I've seen anything from 60 to 90% of outages are due to deploys where somebody either misconfigured something and relaunched a new set of environment variables, or bad code was deployed. That's why we have this epic fail here, because we end up focusing on AWS outages for high availability when in reality, the majority of high availability risk comes from not Amazon Web Services outages and technical problems, but from deploys and improperly managed processes.
Okay, so how do we mitigate deployment risk? Well, deploys should be handled with care, of course. But one of the ways that we can do this is using immutable infrastructures and deploys. So this means rather than modifying infrastructure in place either at the server or maybe the entire stack or stack layer level, you should be using immutable infrastructures and deploys. So if you go and take the deployment course, you'll learn more about immutable infrastructure. But essentially, this means don't modify things in place, just redeploy new pieces of software so it's clean every time.
Remove 100% of manual steps, except for hitting start. So this one may seem obvious to some of you, but it also, beyond just making life easier and saving time, this is also a tremendous risk mitigation step because we are removing the possibility that we do a step that wasn't there before, or we miss a step that should be there whenever we do a manual deploy. So this is a very, very common mistake that is made, and the way to mitigate human errors during deployment is by automating 100% of the steps, except for hitting start of course.
Then of course, you need to select DRIs, or directly responsible individuals. So a directly responsible individual is somebody who's butt is on the line whenever a deployment happens. So if you can imagine in a company where there's shared responsibility for a deploy, there may or may not be a lot of finger pointing or, "Oh well, that's how it went," etc.etc. By appointing directly responsible individuals, you cause a single person to explicitly acknowledge that they're assuming responsibility for a deploy, which makes people very, very motivated to make sure that the deploys go well. Once you have heavy, heavy automation and directly responsible individuals, you can eliminate a lot of risk because you're triple and double checking everything as well as removing as many manual steps as possible. So there's less places for things to go wrong, and those less places are more intently watched by these DRIs.
Try to deploy during off-hours with two or more engineers. So the first part of the sentence for deploying off-hours should be pretty straightforward. If you are an e-commerce retail site that gets a lot of peak traffic in the afternoon, you probably shouldn't be running a risky deploy during the afternoon. Try off-hours, like 3:00 or 4:00 in the morning. If you're a very small company, chalk it up to startup craziness. If you're a larger company, just pay somebody overtime to go and do a deploy if it's important. Or if you're automated enough that you can be deploying in the dark, that's fine. If you do have not fully automated deployments, you should highly recommend doing two engineers at a time. This means that we get consensus. It's the effective pair programming of the DevOps world where it's much harder for two people to miss a mistake than it is for one person to miss a mistake. Again, document all processes and configurations so on and so forth. Now this means not only this is how you launch the new code, but also this is how we constructed our cloud from the very beginning. This is where all of our Amazon Web Services account credentials are, this is who's allowed to access Amazon, this is who should carry the root MFA token, etc.etc. So go way, way beyond the processes and the scripts that happen on the Amazon Cloud, but include business maintenance processes and so forth.
And finally, deploy to staging or small groups first, and then promote to production. So this includes blue/green deployments, this includes Canary and rolling deployments, so on and so forth. By creating a smaller environment, that's lower risk for you to actually test code, there is a huge reduction in risk that would otherwise be assumed during a launch to production. You can not only test the code itself, but also test the deployment script by using the same deployment scripts to push to staging and production. You get good parity and you can be reasonably confident that deployments will work.
So sometimes, EC2 instances die, and you may have a cat come out of your screen. EC2 instances can die due to OS, disks, EBS, I/O. Mechanics don't really matter. Servers can die. Servers do die. Servers die pretty frequently in Amazon. That's one of the caveats that you sign up for whenever you join Amazon Web Services. That they, at any time, can tell you that your instance is going to degrade. They usually try to give you a week's notice so you can kill the old instance and bring a new one online, but sometimes, in the case of emergency, that is when an instance is very quickly failing, they will not give you enough warning and they'll tell you a couple hours before. So we need to plan for these kind of failures.
So our best practices include baking AMIs with new code. So one of the fastest ways for us to launch new servers is to bake AMIs, that is make AMIs and create them whenever it's time to run a new deploy. So whenever I run my continuous integration box, I should create an Amazon machine image for staging, say, deploy the machine images behind my auto-scaling group or whatever technique I use to seamlessly switch the AMIs in, and then whenever I verify that things are working in staging, I can promote to production using the identical AMIs.
I should also be detecting server death with health checks. These can be either auto-scaling group health checks or Elastic Load Balancer health checks that ping via some port. So we need to be able to detect quickly via these health checks. And then quickly replace degraded instances with the AMI. So rather than waiting for instance to fail, you should just delete the old instance, or terminate rather, the old instance and add a new instance in with the AMI.
So never store any state in compute layers. That's another best practice. Externalize state to a database tier. There's a big, big problem with storing persistent data that's mutable on servers that you're expecting to replace, these AMI servers that we're talking about, being able to quickly replace in the case of degradation. Storing database tiers on those layers and co-mingling them with stateless application code of business logic is a very poor idea because then we can't just terminate instances because we have data on those instances that we need to retain. It's very dangerous if you think about how much can go wrong. Say for instance, you have an unscheduled termination of an EC2 instance with database data on it. You're in big trouble.
Another best practice: use cluster databases or failover replicas for your database tier. Prefer cluster databases, which can handle node loss. So this would be something like a REAC or an elastic search, or a MongoDB, anything where if a single node dies, the rest of the nodes can re-elect a master if needed and rebuild all of the system. Barring using a cluster database, like a Mongo or an elastic search or REAC, keep a synchronous replica. So for something like an RDS or an ElastiCache, this is pretty easy. There's a button that you can check that allows you to do multi-availability zone deployments that will handle the synchronous replicas and perform failover whenever you need. And beyond preferring cluster databases and these replicas, just use Amazon Web Services' provided services which outsource these functions. These functions have enduring single computer failures. This would include something like a DynamoDB or even a Redshift. So using Amazon Web Services' servers or services, it's a very good idea to do this kind of risk mitigation.
So we also have availability zone risk beyond just the server risk. Now this is a little bit more serious. As you can see here, I've Xed out a little availability zone in my box model of Amazon Web Services. Key takeaway is that availability zones can and do die all the time. It happens more frequently than you would expect. Now we need to consider four key areas whenever we start talking about this level of death where it's beyond a single server, and we're now talking about an entire zone. We can be looking at network, compute, storage, and database failures, the big four kinds of computational abstraction that we can deal with in the cloud here. So what happens if an availability zone dies? Well, let's walk through some of the techniques we can use to mitigate risk on the network front, the compute front, the storage front, and the database front.
So let's look at our networking best practices for zone risk. Use failover DNS with health checks, not raw IPs. This is pretty important because raw IP addresses are typically a little bit slower to failover than the actively managed route 53 failover DNS with health checks. So if Route 53 or whatever monitoring system detects that a host has gone down in one availability zone or the other, the DNS is able to perhaps redirect records if you have something besides an elastic load balancer that already supports that, etc. Some services already support DNS and failover, in which case you can just use those services from Amazon and not worry about implementing it yourself.
We should also use dual tunnel VPNs for inter-data center routes. So you can imagine if you have a VPN tunnel in availability zone, US East 1A, let's say, the very first availability zone ever. If you have only one VPN tunnel in a single availability zone in a single region, that's not so good when that availability zone goes down. Then you can no longer use your VPN to tunnel into the data center. So make sure that we use dual tunnel VPNs for inter-data center routes. This might include tunneling into two separate availability zones for the same account.
Now when we're talking about network address translation, which are those instances that you put inside of a public subnet and then route towards the internet gateway to allow other instances in private subnets to network over the NAT, and thus communicate with the internet, use two routes and active monitoring switch. So there's two kinds of services that work for NATs at this point. There are NAT gateways and NAT instances. NAT gateways are a relatively a new addition. They are simply a managed version of a NAT instance that Amazon takes care of for you. We still need to make sure to use two routes and active monitoring switches because those gateways still use single availability zones. This goes equally for NATs that you roll yourself and run as instances, as instances are susceptible to these availability zone outages. Thus we should spread the NAT and the availability to originate outbound requests from private subnets across two instances.
So we also need to follow some compute best practices for zone risk. We should use ELBs, or HAProxy with appropriate health checks. So we talked about this one a little bit, but it's very important for us on the compute layer to be able to scale outwards very quickly, as well as handle requests for HAPs resolve, SSL, etc.etc. The ELBs beyond just their usefulness for DNS and that kind of thing are very useful for the fact that they can cross zone load balance, meaning that we can put a bunch of servers behind a single ELB in a region and get very good coverage within that region because if, for instance, US East 1A goes down, then my US East 1B set of instances also behind the same load balancer, so DNS doesn't have to switch. We'll begin routing traffic exclusively to the online availability zone until the secondary availability zone comes back online.
Now when we do something like that with cross zone load balancing, we also need to make sure that we have 100% capacity at N plus one redundancy for our system. That is, we can't run a system that requires four servers on two servers in one availability zone and two in the other if you want your system to remain highly available whenever traffic spikes. You would need to ensure that you have four servers in each zone or, for instance, you could also spread it out across more availability zones and do two in each of three availability zones, thus producing your required node count to six, yet still having it such that if any of the three availability zones that you're using fails, you'll still have four servers available.
So we also have storage best practices for zone risk. With EBS, or Elastic Block Store, take regular snapshots for restores. This is very important because within an availability zone, you may come up against a situation where your EBS node or volume goes down but the entire system goes down, but the entire system goes down and you have no ability to go in and take a snapshot or reattach EBS to anything else. You should be taking regular snapshots for restores and being ready to share them with other systems.
So S3 is extremely good at handling multiple server failures, including availability zone failures. I would highly suggest using S3 over EBS if you at all can simply because S3 is so, so durable, very cheap, and it already has this multi-availability zone support intact.
Now there's a slightly newer service that you may or may not have been able to see yet called the AWS EFS, or Elastic File System, which also works across availability zones. So if you are using EFS as part of the preview, then you are good to go since EFS is constantly synchronized across multiple computers and is shared across availability zones within a region in Amazon.
So when we think about availability zones, we also need to work on our database techniques, rather than just our API layer. We need to think about how to synchronize and maintain state across potentially multiple users, multiple sessions, multiple availability zones, etc.etc. So how do we handle multiple availability zones for databases? We prefer the multi-availability zone deployed multi-master clusters. So that's a very hard one to say, but these would be things again like elastic search, where you can create a cluster that will be able to handle a failure of a large sector of the nodes, and still carry on as long as the replication factor is high enough.
We should always use RDS and ElastiCache with a multiple availability zone setting. There's almost no reason to do it other than a nominal cost savings, but it saves you a world of hurt if there's ever a failure, because RDS and ElastiCache are already creating snapshot backups and can restore very, very quickly based off of their synchronous.
So if we have DynamoDB, we can also realize that it just works across multiple availability zones. It's good news for anybody really bought into the Amazon Web Services platform is that Dynamo works across multiple availability zones and you won't lose your availability if there's any issues.
Okay, so beyond the availability zone, even entire regions can die, which is a markedly less common incidence, but sometimes it happens. As of this recording, it happened about six months ago and people were offline for quite a while. I also remember once when the Virginia data center was hit by three bolts of lightning and the system went offline for about eight hours because of the major natural disaster. So without further adieu, we need to realize that the entire region can die, and we're going to cover the same four main areas of focus again.
So what are our best practices for network high availability during regional failures to mitigate regional risk? Well first we should be looking at latency and failover DNS across regions to differing Elastic Load Balancers. Think about how you might do load balancing within a single region in your availability zones. So you have multiple instances distributed across multiple availability zones with the elastic load balancer as the entry point into your system. Now imagine that we extend this notion of having multiple systems, multiple subsystems able to service requests across multiple regions. Now Elastic Load Balancers only work within availability zones, but what we can do to distribute traffic across multiple regions is through use of this latency DNS.
Latency-based DNS looks at where requests are coming from and routes them to the lowest latency area and routes them to the Elastic Load Balancers. So presumably your Elastic Load Balancers act like your instances do in the availability zone scenario and the latency DNS acts kind of like your Elastic Load Balancer does in the availability zone scenario. So this works with this latency and failover because we can failover from one group to the other since we already have more than one set of Elastic Load Balancers, one set of subsystems that can service requests. If one of them fails, then presumably we can scale up and handle the traffic after we do the failover.
So we should also prefer accessing Amazon Web Services via the SDK, CLI, and not the console. Now the reason for this is during a regional failure. The console itself for each region is actually serviced by the region most near or, for some services, for the specific region that addresses those services. So if I'm accessing my Virginia RDS instances, presumably I will be seeing the console rendered using Amazon Web Services tools from the East region. Now the API, and therefore the SDK and the CLI, more directly rely on just the systems that are online and not the consoles, because the console actually uses CloudFront, DynamoDB, S3, Insert SQS, etc.etc. So if any of those services go down in one of these massive regional outages, we actually won't be able to access the console. So if our existing processes or disaster recovery playbooks tell us to do things in the console, it's probably a bad idea because chances are if we have a massive outage, the console will be out for a certain region as well.
So we should also put as much as we can in CloudFront. Now we want to put things in CloudFront during these regional outages because for the majority of applications the read side of the application can handle slightly stale content, and because CloudFront is actually already distributed across lots of edge locations, if a region goes down, then CloudFront will still from the edge locations be able to service get requests and at least have a partially degraded experience for these different regions. Now if you combine all these techniques so CloudFront's actually reading from multiple origins using a latency DNS, then you can stay online across any of these massive outages.
Okay, so moving to compute best practices. We should be copying our Amazon machine images across regions during deploys. So during our availability zone and single instance failure slides, we talked about using AMIs as a way to quickly spin up new versions of our code if a single server fails or if an entire availability zone fails, we need to be able to quickly activate a lot of new deploys. We don't want to have to spend time waiting for dependent software packages to install as we launch new instances across regions. So if we employ those networking best practices where we have a latency failover DNS plus Elastic Load Balancers, we'll have a sort of failover across multiple regions, but that won't be very useful if we're not using AMIs because we can't spin up instances fast enough to handle the switched over traffic. So that's why we want to use AMIs over install scripts that run every time you launch a new instance.
So we should also be running 100% capacity in two plus regions or with ELBs and auto-scaling groups ready. So depending on how latency and failover sensitive your application is, there will still be time even with these AMIs where we have time to spin up or copy the AMI onto the instance whenever we start it up. So if we imagine a scenario in which we have instances running in the US West 2 region, which is the Oregon region, and the US East 1 region, which is the Virginia region, we've employed our latency failover DNS, and our AMIs are copied across regions. Even if we do this, then if we have a total failure in the, let's pretend the traffic is distributed 50/50, if we have a total failover in the East region, then I will be left with only 50% capacity in the Oregon region or the West region, and 100% of the traffic will begin routing to those instances very quickly. So for that period of time, users may experience very slow behavior, or it just won't work at all, or you could even make it crash. So we need to be running at slightly above capacity in each region in anticipation for these failures for very latency or availability sensitive applications. Now if we have all of these things and we don't do the 100%, then the only latency or lag that you'll have, assuming that you don't crush your entire system with too much traffic, is the amount of time it takes to use these copied AMIs deployed onto the instances.
We should also be looking at using CloudFormation based deploys for parity. So part of the problem with doing multiple regions is that there's a fear of having inconsistency between our two sets of stacks in our different regions. If we're using CloudFormation or these templated deploys, then we will avoid those kind of situations because CloudFormation will be consistent across multiple regions, assuming all those services are available in each region. So if you go over to the other CloudAcademy courses for Advanced CloudFormation, you can learn how to do those advanced deploys and model any kind of cloud into a CloudFormation.
So for storage, we also have some storage best practices that we can use to avoid regional failures. Even S3 can fail sometimes in terms of the access. S3 is not prone to losing objects, but sometimes it is not available. If we use S3 replication, which is a built-in service that you can check the box on when you create a new bucket, S3 will automatically copy all files that you write to the bucket or put to the bucket to a different region, which is excellent because we can make sure that we are having our files accessible even when we have a regional outage. Now this can get expensive for extremely file intensive applications, so you should use this wisely and make sure that you're not over-engineering the problem. However, if you use S3 replication, you can stay on for entire region failures even if S3 dies.
We should also be looking at backing up and regionally copying EBS volumes. So this one is a little bit trickier because if you have lots of read and writes coming out of databases, some people use this as a technique to regionally copy database snapshots across regions. This is more of a backup and recovery technique than it is a high availability technique because EBS is not very good at doing synchronous replications across regions and you can have missing data if one of your snapshots was taken considerably before the failure. However, this is a good way to do things if you are manually constructing instances or volumes and you don't want to create entire AMI. You can copy or snapshot the EBS volume and make it copy across the regions.
So make sure your application is base and not acid. So when we're talking about storage, which is stateful, we need to make sure that if we have some level of inconsistency for a brief period of time across regions, that the application is not depending on consistent behavior across regions. So this gives you all kinds of things like being able to handle latency as we copy data across the two regions, but also in the event of failure, where the amount of time that is required to synchronize the two regions is higher. That is, if we're down for three hours, we may not resynchronize our S3 or our data backups for three hours. If I'm deploying a base application and not acid, then I will be able to recover after those three hours because the application already knows how to handle eventually consistent behavior.
So we also need to design repair daemons for recovery process. This is a key step that a lot of businesses forget. Netflix is actually very good at this, and you should read their open source documentation and all of their white papers if you're interested. The way that you would do this is for instance, if you detect a failure in part of our big three techniques that we talked about earlier, once your system that is used for detecting the failures realizes that there's been a failure, at the end of the failure when you've seen that the other system comes back online. Say, Virginia goes down and Oregon keeps servicing all requests, Oregon will need to recopy all of the data or transactions, or any kind of changes or storage material that comes through during the time that Virginia is down and do a little bit of extra work to continue resynchronizing.
So when we're looking at regional risk for database best practices, we need to be looking at scalable cluster DBs or Amazon Web Services services. So when we're looking at something like a cluster DB, that would be an elastic search, or a REAC, or a Cassandra. These are databases that can already handle distribution across multiple nodes and they're already using replicas and base consistency and not acid. So they're already good at handling partitioning across multiple instances or servers. I wouldn't suggest deploying your cluster database across multiple regions. However, because they can handle, if you're already designing your system this way, they can handle these base behaviors. You can use, for instance, Amazon Lambda or some other copying daemon to replicate your data across your data, and your transactions across multiple regions. You can also use AWS services to do this. The simplest way to do this is using DynamoDB with a Lambda function to replicate any transactions that come across DynamoDB streams into the other region.
So we also want to be deploying multi-master and multi-regionally if we want true true high availability. This is this active/active model versus active/passive. So you are already familiar, likely, with RDS, multi-availability zone. That is a active/passive model where one instance is serving all requests at all times until something bad happens on that server, then we failover. That's great, but it doesn't work for multi-region for a number of reasons. If we need true high availability, we need to have both data centers able to service requests at all times. That way we are getting as much synchronization in both directions as possible, and when something breaks, your database doesn't have a 5, 10, 15 minute lag while you're switching everything over from active to passive.
We should also be using asynchronous replication and write conflict resolution. As we look to go multi-region, we are introducing higher and higher latency between our systems. So if we have a multi-master, multi-regional database, it is a high chance that somebody could write to the same object twice at the same time in two different regions, and both regions will think that they have the original write. You need to have a system in place where the application is aware of, for instance, timestamps on the writes, and it picks which one wins. This is a large swathe of computer science, which is figuring out which system needs to be the master in the eventually consistent situation, but can be accomplished using a simple thing like an updated since timestamp and Lambda's copying in two different directions across regions.
We should also be looking at buffering writes to a queue to repair after recovery. So if we imagine a situation which we're operating one of these multi-master, multi-region databases that's low latency and high availability, we might have a problem where, for instance, if we're looking at, again, Oregon and Virginia, which are several thousand miles apart, if we have a problem where the Virginia data center goes down for about three hours, and the Oregon database is servicing all requests, even if during normal operation we have correct asynchronous replication and write conflict resolution, some of the replication may fail as the Oregon database tries to emit its changes to the Virginia data center during that downtime. If we're using a write buffer and a repair after recovery system, then the Oregon data center will have a queue of the writes that it needs to send over to Virginia naturally as part of the async replication and write conflict resolution process, but it can also handle a three hour delay if we're using a real queue to help with the repair after recovery. Some of our switching systems should also be aware of when to flush that queue after the other system comes back online and we can resynchronize the databases.
So that closes up our techniques lesson where we learned all of the different techniques that we can use for three different levels of failure. Talked about the main areas that we should be focusing on when we're thinking about our big three areas of high availability techniques.
Now we'll be going on to a planning demo where I'll walk you through a sample application or two, where we design for high availability for a system that we're planning to build.
About the Author
Nothing gets me more excited than the AWS Cloud platform! Teaching cloud skills has become a passion of mine. I have been a software and AWS cloud consultant for several years. I hold all 5 possible AWS Certifications: Developer Associate, SysOps Administrator Associate, Solutions Architect Associate, Solutions Architect Professional, and DevOps Engineer Professional. I live in Austin, Texas, USA, and work as development lead at my consulting firm, Tuple Labs.