In this brief course, our AWS expert, Stuart Scott, answers the question: What's the difference between high availability and fault tolerance?
They both ultimately have the same goal, to keep your systems up and running should something fail within your architecture, but there is a difference and this can cause some confusion. In this course, you'll get a clear idea of what each one is, and how they differ.
What's the difference between high availability and fault tolerance?
This is a question that gets asked a lot, I hear it from people who have had years of experience within the IT industry, and those who are new and just starting out. Either way, there is clearly some confusion between the two, and understandably so. They both ultimately have the same goal, to keep your systems up and running should something fail within your architecture, but there is a difference.
High Availability can be defined by maintaining a percentage of uptime which maintains operational performance, and so this can closely be aligned to an SLA. In fact, AWS has many SLAs for its services where they implement their own level of resilience and management to maintain that level of high availability.
You may also have your own SLAs for services you provide for your customers, but let’s look at a scenario to help explain the difference. Let’s assume that you have an application which has to run across a minimum of two EC2 instances to meet, let’s say, an SLA of 99.9% which allows for a downtime of 43.83 minutes per month, then we could architect our infrastructure like this.
Within a region we could use two different availability zones, and in each AZ we could have two EC2 instances, which are all associated with an Elastic load balancer. So in this example we have different elements contributing to a highly available solution. We have the use of two AZs and additional EC2 instances. So if an instance fails, we still have plenty of compute resources, or if an entire AZ fails then we still have the minimum of two instances to maintain the required SLA.
Now, let’s look at Fault Tolerance, which expands on High Availability to offer a greater level of protection should components begin to fail in your infrastructure, however, there are usually additional cost implications due to the greater level of resiliency offered. But the upside is that your uptime percentage increases and there is no interruption of service should 1 or more components fail. With that in mind, we could argue that having two AZs with two EC2 instances in each is fault-tolerant at the AZ level, as operations would be maintained at the loss of an AZ as we’d still have the minimum number of instances still running, but should another failure occur, then the SLA would be impacted.
So let’s look at how we take our high availability scenario that we just sketched out and adopt it with an increased fault-tolerant design approach.
So previously we had our single region approach, and so to increase the uptime of this solution we could deploy the app across an additional AWS region. So we could literally mirror the environment from a single region to a 2nd region. So, this means should an EC2 instance fail, we still have compute, should an AZ fail, again, we still have enough compute capacity, but now we can also maintain operation should an entire Region fail. If it does, we can then still suffer further EC2 outages, and AZ outages of that secondary region and still maintain the minimum requirements of having two EC2 instances at all times.
This of course offers a far greater uptime availability compared to the previous highly available solution we had with a single region, but it comes at the increased cost of running two active environments, which can tolerate any component to fail. Remember, we need to have this secondary region running to take advantage of avoiding any downtime should the primary region fail.
So from this we can surmise that fault-tolerant systems are intrinsically highly available, but as we have seen, a highly available solution is not necessarily completely fault-tolerant.
It’s down to you as to the level of High Availability or Fault Tolerance you want to implement, and this really depends on the business impact it would have when components begin to fail, and do bear in mind, it’s not if a failure occurs, it’s when a failure occurs.
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.