The DynamoDB-Caused AWS Outage: What We Have Learned

Over the course of a few hours this past September 20, some of the Internet’s most popular sites like Netflix, Airbnb, and IMDb – along with other AWS customers – suffered major latency and even some outages. The proximate cause? Amazon’s Status dashboard told the story of this AWS outage:
Amazon's Status dashboard
The official note announcing the AWS outage read:

Between 2:13 AM and 8:15 AM PDT we experienced high error rates for API requests in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.”

A six-hour AWS outage will almost certainly translate to catastrophic failure for someone. This is especially true considering that this outage had an impact on as many as 22 AWS services, including DynamoDB, CloudWatch, Auto-Scaling, Simple Email Service (SES), Simple Notification Service (SNS), Simple Queue Service (SQS), CloudFormation, Lambda, SWF, and WorkSpaces (all in the N. Virginia region).

What caused the AWS outage?

The outage and its cause were identified by early that morning when error rates in Amazon DynamoDB started increasing. In a short time, most of the other major services in US-standard region were dragged in.
The root cause was identified as a problem with Amazon’s DynamoDB metadata service for partitioning.

An unexpected network disruption briefly affected DynamoDB’s storage servers ability to communicate with its metadata services. When the network issue was resolved, many storage servers simultaneously tried to load the metadata. While this usually goes off seamlessly, in this instance, the extra traffic caused the metadata service responses to exceed the retrieval and transmission time allowed by storage servers, causing storage servers to reject any further requests. After many unsuccessful attempts to bring down the load and increase the capacity of the metadata service, the servers needed to shut down.

The impact cascaded through the AWS system, dragging down other services that use DynamoDB to store their internal tables. After six hours of firefighting, AWS engineers brought up the capacity of metadata service significantly, the metadata service is successfully reactivated, and storage servers are brought back up to full operation.

What was done in the aftermath of the AWS outage?

Amazon has taken many preventive actions to avoid any recurrence of similar events. The capacity of the metadata service has already been increased significantly, stricter monitoring is put in place to identify the membership size and arrive at the correct capacity. For the longer term, Amazon plans to segment the DynamoDB service so that instances of the metadata service each serve only portions of the storage server fleet.

Lessons learned from the AWS outage

Outages are bound to happen, whether your infrastructure is in an on-premise data center or the cloud. But to minimize your risk, your architecture should be built with a philosophy of “failure is bound to happen”.  Netflix, the media giant that relies heavily on AWS for its operation, quickly recovered from this crisis. They attribute their resilience to what they call chaos engineering.

With its experience from past AWS outages, Netflix regularly deploys its Simian Army: software that deliberately attempts to disrupt its systems. Chaos Monkey shutdowns their production system randomly. Chaos Gorilla simulates an availability-zone failure and Latency Monkey introduces latency on the network. By constantly testing itself with failures, Netflix barely blinked this time around, as it quickly redirected traffic from the impacted AWS region to datacenters in an unaffected area.

Netflix also maintains active-active replication for critical data. Though it cost them 25% more on their AWS bill, it serves them (and their customers) very well fends in just the kind of emergency we’re talking about.

Capacity planning and stricter monitoring of newer services is a must. In our case, an important element of the problem was the increased metadata generated by a new feature called the Global Secondary Index (GSI). GSI allows users to access the table using an alternate key. With GSI, the partition per table increased significantly for some very large tables. With a larger volume of data, the processing time inside the metadata service for some membership requests began to exceed the retrieval allowed time by storage servers. Due to the limited capacity of the metadata service, this quickly became an outage. According to Amazon

We did not have detailed enough monitoring for this dimension (membership size), and didn’t have enough capacity allocated to the metadata service to handle these much heavier requests.

Amazon quickly apologized to customers, while noting that DynamoDB has effectively enjoyed 100 percent uptime in the past three years.

“We apologize for the impact to affected customers. While we are proud of the last three years of availability on DynamoDB (it’s effectively been 100%), we know how critical this service is to customers, both because many use it for mission-critical operations and because AWS services also rely on it. For us, availability is the most important feature of DynamoDB, and we will do everything we can to learn from the event and to avoid a recurrence in the future.”

Amazon did its best given the problem it faced. They have also been careful to provide helpful and reassuring communication in the AWS outage. As a user of a massive service like AWS, their customers should also shoulder their share of the responsibility, to design and operate their infrastructures more like Netflix.

If you want to get a jump start on DynamoDB, check out Cloud Academy’s Working with Amazon DynamoDB Course.

Learn how to create Amazon DynamoDB tables, add indexes, and query your data in the Introduction to DynamoDB Hands-on Lab.

Avatar

Written by

Chandan Patra

Cloud Computing and Big Data professional with 10 years of experience in pre-sales, architecture, design, build and troubleshooting with best engineering practices. Specialities: Cloud Computing - AWS, DevOps(Chef), Hadoop Ecosystem, Storm & Kafka, ELK Stack, NoSQL, Java, Spring, Hibernate, Web Service


Related Posts

Joe Nemer
Joe Nemer
— April 3, 2020

Breaking News: All AWS Certification Exams Now Available Online

Remote proctoring for all AWS certifications Cloud Academy is an Advanced AWS Technology Partner, and we are happy to announce all AWS certification exams are available online!  What does this mean for you? You can stay focused on your certification goal. Or you can start a certifica...

Read more
  • AWS
  • AWS certification
  • AWS Certifications
Connie Benton
Connie Benton
— April 1, 2020

How To Build a Career with AWS Certifications

From Iaas and PaaS solutions to digital marketing, cloud computing reshapes the world of technology. As the influence of this technology grows, so does investment. Tens of billions of dollars are being spent on cloud computing-related services each year. This influx is continuing to inc...

Read more
  • AWS
  • Certifications
Vijayakumar Athithan
Vijayakumar Athithan
— March 27, 2020

What is Cognito in AWS?

Web applications usually allow a valid username and password combination for successful sign in to the application. Modern authentication flows incorporate more approaches to ensure user authentication. When using AWS, this is no exception, thanks to the abilities and features offered b...

Read more
  • AWS
  • AWS Cognito
  • Solutions Architect
Avatar
Andrew Larkin
— March 20, 2020

The 12 AWS Certifications: Which is Right for You and Your Team?

As companies increasingly shift workloads to the public cloud, cloud computing has moved from a nice-to-have to a core competency in the enterprise. This shift requires a new set of skills to design, deploy, and manage applications in cloud computing. As the market leader and most ma...

Read more
  • AWS
  • AWS Certifications
Alisha Reyes
Alisha Reyes
— March 17, 2020

Cloud Academy’s Blog Digest: How Do AWS Certifications Increase Your Employability, How to Become a Microsoft Certified Azure Data Engineer, and more

With everything going on right now, it's likely that the only thing you've been reading lately is related to the coronavirus pandemic. It's important to stay informed during these times, but it's also good to jump into something that can take your mind off of the current situation for j...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • programming
  • Security
Avatar
Cloud Academy Team
— March 13, 2020

Which Certifications Should I Get?

As we mentioned in an earlier post, the old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and compan...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— March 7, 2020

New on Cloud Academy: Intro to GitOps; AWS Courses; Java, Python, Amazon Linux 2, Ubuntu, & Docker Playgrounds; and much more

New Lab Playgrounds This month, our Content Team released six new "playground labs." Our playground labs provide a safe and secure sandbox environment for you to explore your own ideas, follow along with Cloud Academy courses, or answer your own questions — all without having to instal...

Read more
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Alisha Reyes
Alisha Reyes
— March 6, 2020

New on Cloud Academy: Intro to GitOps; AWS Courses; Java, Python, Amazon Linux 2, Ubuntu, & Docker Playgrounds; and much more

New Lab Playgrounds This month, our Content Team released six new "playground labs." Our playground labs provide a safe and secure sandbox environment for you to explore your own ideas, follow along with Cloud Academy courses, or answer your own questions — all without having to instal...

Read more
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Patrick Navarro
Patrick Navarro
— March 4, 2020

AWS Certifications: How Do They Increase Your Employability and Progress Your Career?

AWS certifications are no walk in the park. They’re designed to validate in-depth, specialist knowledge and comprehensive experience, often requiring months of dedicated studying to earn even for those already working with the cloud platform. But the rewards that AWS professionals ca...

Read more
  • AWS
  • AWS certification
  • certification
Avatar
Chandan Patra
— February 21, 2020

Elasticsearch vs. CloudSearch: AWS Cloud Search Choices

Elasticsearch vs. CloudSearch: What's the main difference? Let's compare AWS-based cloud tools: Elasticsearch vs. CloudSearch. While both services use proven technologies, Elasticsearch is more popular, open source, and has a flexible API to use for customization; in comparison, CloudS...

Read more
  • AWS
  • Azure
  • cloudsearch
  • elasticsearch
Avatar
Andrew Larkin
— February 13, 2020

Cloud Academy Content Roadmap Updates

Welcome to our Q1 2020 roadmap. This is the content we plan to build over the next three months, between February 1 - and April 30, 2020. Let's look at some of our roadmap highlights. Atlassian Bamboo for CI/CD We had a lot of requests for practical guides on how to apply DevOps tool...

Read more
  • Artificial Intelligence
  • AWS
  • Azure
  • Docker
  • Google Cloud Platform
  • Kubernetes
  • Machine Learning
Alisha Reyes
Alisha Reyes
— February 7, 2020

New on Cloud Academy: Git Labs, CKA and CKAD Lab Challenges, AWS and Azure Learning Paths, AGILE, and Much More

We just kicked off our first Free Weekend of 2020. This means we've unlocked our Training Library for just 72 hours. Until Sunday at 11:59 pm (PST), you can get unlimited access to our industry-leading learning paths, courses, certification prep exams, and our most popular hands-on labs...

Read more
  • agile
  • AWS
  • Azure
  • Google Cloud Platform
  • Linux
  • OWASP
  • programming
  • red hat
  • scrum