Over the course of a few hours this past September 20, some of the Internet’s most popular sites like Netflix, Airbnb, and IMDb – along with other AWS customers – suffered major latency and even some outages. The proximate cause? Amazon’s Status dashboard told the story of this AWS outage:
The official note announcing the AWS outage read:
Between 2:13 AM and 8:15 AM PDT we experienced high error rates for API requests in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.”
A six hour AWS outage will almost certainly translate to catastrophic failure for someone. This is especially true considering that this outage had an impact on as many as 22 AWS services, including DynamoDB, CloudWatch, Auto-Scaling, Simple Email Service (SES), Simple Notification Service (SNS), Simple Queue Service (SQS), CloudFormation, Lambda, SWF, and WorkSpaces (all in the N. Virginia region).
What caused the AWS outage?
The outage and its cause were identified by early that morning when error rates in Amazon DynamoDB started increasing. In a short time, most of the other major services in US-standard region were dragged in.
The root cause was identified as a problem with Amazon’s DynamoDB metadata service for partitioning. An unexpected network disruption briefly affected DynamoDB’s storage servers ability to communicate with its metadata services. When the network issue was resolved, many storage servers simultaneously tried to load the metadata. While this usually goes off seamlessly, in this instance, the extra traffic caused the metadata service responses to exceed the retrieval and transmission time allowed by storage servers, causing storage servers to reject any further requests. After many unsuccessful attempts to bring down the load and increase the capacity of the metadata service, the servers needed to shut down.
The impact cascaded through the AWS system, dragging down other services that use DynamoDB to store their internal tables. After six hours of firefighting, AWS engineers brought up the the capacity of metadata service significantly, the metadata service is successfully reactivated, and storage servers are brought back up to full operation.
What was done in the aftermath of the AWS outage?
Amazon has taken many preventive actions to avoid any recurrence of similar events. The capacity of the metadata service has already been increased significantly, stricter monitoring is put in place to identify the membership size and arrive at correct capacity. For the longer term, Amazon plans to segment the DynamoDB service so that instances of the metadata service each serve only portions of the storage server fleet.
Lessons learned from the AWS outage
Outages are bound to happen, whether your infrastructure is in an on-premise data center or the cloud. But to minimize your risk, your architecture should be built with a philosophy of “failure is bound to happen”. Netflix, the media giant that relies heavily on AWS for its operation, quickly recovered from this crisis. They attribute their resilience to what they call “chaos engineering”.
With its experience from past AWS outages, Netflix regularly deploys its Simian Army: software that deliberately attempts to disrupt its systems. Chaos Monkey shut downs their production system randomly. Chaos Gorilla simulates an availability-zone failure and Latency Monkey introduces latency on network. By constantly testing itself with failures, Netflix barely blinked this time around, as it quickly redirected traffic from the impacted AWS region to datacenters in an unaffected area.
Netflix also maintains active-active replication for critical data. Though it cost them 25% more on their AWS bill, it serves them (and their customers) very well fends in just the kind of emergency we’re talking about.
Capacity planning and stricter monitoring of newer services is a must. In our case, in important element of the problem was the increased metadata generated by a new feature called the Global Secondary Index (GSI). GSI allows users to access the table using an alternate key. With GSI, the partition per table increased significantly for some very large tables. With a larger volume of data, the processing time inside the metadata service for some membership requests began to exceed the retrieval allowed time by storage servers. Due to the limited capacity of the metadata service, this quickly became an outage. According to Amazon
We did not have detailed enough monitoring for this dimension (membership size), and didn’t have enough capacity allocated to the metadata service to handle these much heavier requests.
Amazon quickly apologized to customers, while noting that DynamoDB has effectively enjoyed 100 percent uptime in the past three years.
“We apologize for the impact to affected customers. While we are proud of the last three years of availability on DynamoDB (it’s effectively been 100%), we know how critical this service is to customers, both because many use it for mission-critical operations and because AWS services also rely on it. For us, availability is the most important feature of DynamoDB, and we will do everything we can to learn from the event and to avoid a recurrence in the future.”
Amazon did its best given the problem it faced. They have also been careful to provide helpful and reassuring communication in the AWS outage. As a user of a massive service like AWS, their customers should also shoulder their share of the responsibility, to design and operate their infrastructures more like Netflix.
If you want to deepen your understanding of how DynamoDB works, try this Cloud Academy course.
WaitCondition Controls the Pace of AWS CloudFormation Templates
AWS's WaitCondition can be used with CloudFormation templates to ensure required resources are running.As you may already be aware, AWS CloudFormation is used for infrastructure automation by allowing you to write JSON templates to automatically install, configure, and bootstrap your ...
The 9 AWS Certifications: Which is Right for You and Your Team?
As companies increasingly shift workloads to the public cloud, cloud computing has moved from a nice-to-have to a core competency in the enterprise. This shift requires a new set of skills to design, deploy, and manage applications in the cloud.As the market leader and most mature p...
Two New EC2 Instance Types Announced at AWS re:Invent 2018 – Monday Night Live
The announcements at re:Invent just keep on coming! Let’s look at what benefits these two new EC2 instance types offer and how these two new instances could be of benefit to you. If you're not too familiar with Amazon EC2, you might want to familiarize yourself by creating your first Am...
Google Cloud Certification: Preparation and Prerequisites
Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2018, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the first time. In t...
Understanding AWS VPC Egress Filtering Methods
In order to understand AWS VPC egress filtering methods, you first need to understand that security on AWS is governed by a shared responsibility model where both vendor and subscriber have various operational responsibilities. AWS assumes responsibility for the underlying infrastructur...
S3 FTP: Build a Reliable and Inexpensive FTP Server Using Amazon’s S3
Is it possible to create an S3 FTP file backup/transfer solution, minimizing associated file storage and capacity planning administration headache?FTP (File Transfer Protocol) is a fast and convenient way to transfer large files over the Internet. You might, at some point, have conf...
Microservices Architecture: Advantages and Drawbacks
Microservices are a way of breaking large software projects into loosely coupled modules, which communicate with each other through simple Application Programming Interfaces (APIs).Microservices have become increasingly popular over the past few years. The modular architectural style,...
What Are Best Practices for Tagging AWS Resources?
There are many use cases for tags, but what are the best practices for tagging AWS resources? In order for your organization to effectively manage resources (and your monthly AWS bill), you need to implement and adopt a thoughtful tagging strategy that makes sense for your business. The...
How to Optimize Amazon S3 Performance
Amazon S3 is the most common storage options for many organizations, being object storage it is used for a wide variety of data types, from the smallest objects to huge datasets. All in all, Amazon S3 is a great service to store a wide scope of data types in a highly available and resil...
How to Optimize Cloud Costs with Spot Instances: New on Cloud Academy
One of the main promises of cloud computing is access to nearly endless capacity. However, it doesn’t come cheap. With the introduction of Spot Instances for Amazon Web Services’ Elastic Compute Cloud (AWS EC2) in 2009, spot instances have been a way for major cloud providers to sell sp...
What are the Benefits of Machine Learning in the Cloud?
A Comparison of Machine Learning Services on AWS, Azure, and Google CloudArtificial intelligence and machine learning are steadily making their way into enterprise applications in areas such as customer support, fraud detection, and business intelligence. There is every reason to beli...
How to Use AWS CLI
The AWS Command Line Interface (CLI) is for managing your AWS services from a terminal session on your own client, allowing you to control and configure multiple AWS services.So you’ve been using AWS for awhile and finally feel comfortable clicking your way through all the services....