Over the course of a few hours this past September 20, some of the Internet’s most popular sites like Netflix, Airbnb, and IMDb – along with other AWS customers – suffered major latency and even some outages. The proximate cause? Amazon’s Status dashboard told the story of this AWS outage:
The official note announcing the AWS outage read:
Between 2:13 AM and 8:15 AM PDT we experienced high error rates for API requests in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.”
A six-hour AWS outage will almost certainly translate to catastrophic failure for someone. This is especially true considering that this outage had an impact on as many as 22 AWS services, including DynamoDB, CloudWatch, Auto-Scaling, Simple Email Service (SES), Simple Notification Service (SNS), Simple Queue Service (SQS), CloudFormation, Lambda, SWF, and WorkSpaces (all in the N. Virginia region).
What caused the AWS outage?
The outage and its cause were identified by early that morning when error rates in Amazon DynamoDB started increasing. In a short time, most of the other major services in US-standard region were dragged in.
The root cause was identified as a problem with Amazon’s DynamoDB metadata service for partitioning.
An unexpected network disruption briefly affected DynamoDB’s storage servers ability to communicate with its metadata services. When the network issue was resolved, many storage servers simultaneously tried to load the metadata. While this usually goes off seamlessly, in this instance, the extra traffic caused the metadata service responses to exceed the retrieval and transmission time allowed by storage servers, causing storage servers to reject any further requests. After many unsuccessful attempts to bring down the load and increase the capacity of the metadata service, the servers needed to shut down.
The impact cascaded through the AWS system, dragging down other services that use DynamoDB to store their internal tables. After six hours of firefighting, AWS engineers brought up the capacity of metadata service significantly, the metadata service is successfully reactivated, and storage servers are brought back up to full operation.
What was done in the aftermath of the AWS outage?
Amazon has taken many preventive actions to avoid any recurrence of similar events. The capacity of the metadata service has already been increased significantly, stricter monitoring is put in place to identify the membership size and arrive at the correct capacity. For the longer term, Amazon plans to segment the DynamoDB service so that instances of the metadata service each serve only portions of the storage server fleet.
Lessons learned from the AWS outage
Outages are bound to happen, whether your infrastructure is in an on-premise data center or the cloud. But to minimize your risk, your architecture should be built with a philosophy of “failure is bound to happen”. Netflix, the media giant that relies heavily on AWS for its operation, quickly recovered from this crisis. They attribute their resilience to what they call chaos engineering.
With its experience from past AWS outages, Netflix regularly deploys its Simian Army: software that deliberately attempts to disrupt its systems. Chaos Monkey shutdowns their production system randomly. Chaos Gorilla simulates an availability-zone failure and Latency Monkey introduces latency on the network. By constantly testing itself with failures, Netflix barely blinked this time around, as it quickly redirected traffic from the impacted AWS region to datacenters in an unaffected area.
Netflix also maintains active-active replication for critical data. Though it cost them 25% more on their AWS bill, it serves them (and their customers) very well fends in just the kind of emergency we’re talking about.
Capacity planning and stricter monitoring of newer services is a must. In our case, an important element of the problem was the increased metadata generated by a new feature called the Global Secondary Index (GSI). GSI allows users to access the table using an alternate key. With GSI, the partition per table increased significantly for some very large tables. With a larger volume of data, the processing time inside the metadata service for some membership requests began to exceed the retrieval allowed time by storage servers. Due to the limited capacity of the metadata service, this quickly became an outage. According to Amazon
We did not have detailed enough monitoring for this dimension (membership size), and didn’t have enough capacity allocated to the metadata service to handle these much heavier requests.
Amazon quickly apologized to customers, while noting that DynamoDB has effectively enjoyed 100 percent uptime in the past three years.
“We apologize for the impact to affected customers. While we are proud of the last three years of availability on DynamoDB (it’s effectively been 100%), we know how critical this service is to customers, both because many use it for mission-critical operations and because AWS services also rely on it. For us, availability is the most important feature of DynamoDB, and we will do everything we can to learn from the event and to avoid a recurrence in the future.”
Amazon did its best given the problem it faced. They have also been careful to provide helpful and reassuring communication in the AWS outage. As a user of a massive service like AWS, their customers should also shoulder their share of the responsibility, to design and operate their infrastructures more like Netflix.
If you want to get a jump start on DynamoDB, check out Cloud Academy’s Working with Amazon DynamoDB Course.
Learn how to create Amazon DynamoDB tables, add indexes, and query your data in the Introduction to DynamoDB Hands-on Lab.
New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More
This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...
Blog Digest: Which Certifications Should I Get?, The 12 Microsoft Azure Certifications, 6 Ways to Prevent a Data Breach, and More
This month, we were excited to announce that Cloud Academy was recognized in the G2 Summer 2020 reports! These reports highlight the top-rated solutions in the industry, as chosen by the source that matters most: customers. We're grateful to have been nominated as a High Performer in se...
Which Certifications Should I Get?
The old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and companies. With all that in mind, the s...
New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More
This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...
Kickstart Your Tech Training With a Free Week on Cloud Academy
Are you looking to make a jump in your technical career? Want to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Kubernetes, Python, or another in-demand skill? Then you'll want to mark your calendar. Starting Monday, June 22 at 12:00 a.m. PDT (3:00 a.m. EDT), ...
New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More
This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...
Azure vs. AWS: Which Certification Provides the Brighter Future?
More and more companies are using cloud services, prompting more and more people to switch their current IT position to something cloud-related. The problem is most people only have that much time after work to learn new technologies, and there are plenty of cloud services that you can ...
Blog Digest: 5 Reasons to Get AWS Certified, OWASP Top 10, Getting Started with VPCs, Top 10 Soft Skills, and More
Thank you for being a valued member of our community! We recently sent out a short survey to understand what type of content you would like us to add to Cloud Academy, and we want to thank everyone who gave us their input. If you would like to complete the survey, it's not too late. It ...
New Content: Alibaba, Azure Cert Prep: AI-100, AZ-104, AZ-204 & AZ-400, Amazon Athena Playground, Google Cloud Developer Challenge, and much more
This month, our Content Team released 8 new learning paths, 4 courses, 7 labs in real cloud environments, and 4 new knowledge check assessments. Not only that, but we introduced our very first course on Alibaba Cloud, and our expert instructors are working 'round the clock to create 6 n...
Top 5 Reasons to Get AWS Certified Right Now
Cloud computing trends are on the rise and have been for some time already. Fortunately, it’s never too late to start learning cloud computing. Skills like AWS and others associated with cloud computing are in high demand because cloud technologies have become crucial for many businesse...
Introducing Our Newest Lab Environments: Lab Playgrounds
Want to train in a real cloud environment, but feel slowed down by spinning up your own deployments? When you consider security or pricing costs, it can be costly and challenging to get up to speed quickly for self-training. To solve this problem, Cloud Academy created a new suite of la...
Blog Digest: AWS Breaking News, Azure DevOps, AWS Study Guide, 8 Ways to Prevent a Ransomware Attack, and More
New articles by topic AWS Azure Data Science Google Cloud Cloud Adoption Platform Updates & New Content Security Women in Tech AWS Breaking News: All AWS Certification Exams Now Available Online As an Advanced AWS Technology Partner, C...