image
Balancing Partitions in Large Tables
Start course
Difficulty
Intermediate
Duration
1h 32m
Students
20918
Ratings
4.6/5
Description

Please note this course is outdated and has been replaced with the following courses:

 

This course provides an introduction to working with Amazon DynamoDB, a fully-managed NoSQL database service provided by Amazon Web Services. We begin with a description of DynamoDB and compare it to other database platforms. The course continues by walking you through designing tables, and reading and writing data, which is somewhat different than other databases you may be familiar with. We conclude with more advanced topics including secondary indexes and how DynamoDB handles very large tables.

Course Objectives

You will gain the following skills by completing this course:

  • How to create DynamoDB tables.
  • How to read and write data.
  • How to use queries and scans.
  • How to create and query secondary indexes.
  • How to work with large tables. 

Intended Audience

You should take this course if you have:

  • An understanding of basic AWS technical fundamentals.
  • Awareness of basic database concepts, such as tables, rows, indexes, and queries.
  • A basic understanding of computer programming. The course includes some programming examples in Python.

Prerequisites 

See the Intended Audience section.

This Course Includes

  • Expert-guided lectures about Amazon DynamoDB.
  • 1 hour and 31 minutes of high-definition video. 
  • Expert-level instruction from an industry veteran. 

What You'll Learn

Video Lecture What You'll Learn
DynamoDB Basics A basic and foundational overview of DynamoDB.
Creating DynamoDB Tables How to create DynamoDB tables and understand key concepts.
Reading and Writing Data How to use the AWS Console and API to read and write data.
Queries and Scans How to use queries and scans with the AWS Console and API.
Secondary Indexes How to work with Secondary Indexes.
Working with Large Tables How to use partitioning in large tables.

If you have thoughts or suggestions for this course, please contact Cloud Academy at support@cloudacademy.com.

Transcript

This video will talk about why it's important to make sure that your partitions are balanced when your tables get large. The best way to do that is to select a good partition key. Let's walk through an example that explains why.

Say that we've reserved 300 read capacity units and 100 write capacity units for our table, if the table has four partitions, then each partition will be able to use 75 read capacity units and 25 write capacity units. This works great if you only have a partition key and no sort key. Every partition key maps to exactly one item in the table so the table's data will be nicely balanced. It also works fine if you have a roughly equal number of records for each partition key.

If this table held our order line items and each order had just a few line items then this would also be fine. They don't have to be perfectly equal but as long as they're close then on average the partitions will stay pretty balanced. But it really starts to break down when some partition keys have massively different numbers of items than others.

Say Walmart is one of our customers and so their orders have thousands of line items while all the rest of the orders have many, many fewer items. This could cause the table to end up looking like this. The third partition actually ends up being so much larger that it won't fit on the slide.

In this example, one partition is smaller than the rest and one is much larger than the rest. But they're still each given one quarter of the read capacity reserved and one quarter of the write capacity reserved. When you run out of provision capacity units in a partition, DynamoDB reads and writes stop functioning in that partition. Instead you get an exception called a provision throughput exceeded exception. AWS claims that this resets every second but my experience is that once a partition gets into that situation, it can take a few minutes or longer to get back to normal even if traffic levels drop quickly.

The good news is that AWS gives you some burst capacity for free which can help with short, mild spikes in traffic. Amazon will provide you with a little bit of burst capacity for each of your tables which you can then consume very quickly if you have a big burst of traffic. But AWS uses the burst capacity themselves for background maintenance tasks so you can't always rely on it. And they don't tell you how much burst capacity you have available or how much you've used.

In practice, the burst capacity can sometimes make things more confusing. I've seen provision throughput exceeded exceptions appear suddenly when traffic patterns haven't changed. The exceptions start happening for a few minutes or a few hours and then go away just as suddenly. I've inferred that this might have been because the table was relying on burst capacity without realizing it and that burst capacity would not be available when it was needed by other systems tasks.

In the short term, one way to work around these exceptions is to over provision your table's capacity. In this example if we've been running into provision throughput exceeded exceptions when reading and writing data then by quadrupling the read capacity units and the write capacity units for the entire table the busiest partition will be fine. But this can be quite expensive when you have a large table.

Say your table has 64 partitions, if you're only hitting throughput issues on one or two of them that means that you'll have to allocate massive amounts of throughput to 60 some partitions that don't even need it. The only true solution is to choose a partition key that ensures that the data will be balanced nicely.

In our Order Line Items table, rather than using order ID as the partition key and line number as the sort key, we could use a new artificial partition key like a line item ID. Every partition key, every line item ID would include exactly one line item that way the data would never get out of balance. The downside is that this would add complexity to your table design and it would mean that you'd have to redesign your queries and the indexes that support those queries.

In this example we would need to add a secondary index just to look up the line items that are included in each order. We'd also have to adjust our existing secondary indexes. We'd have to get rid of our local secondary indexes because the partition key has changed. There's no sort key, so there's only one line item per partition key. That means that there's no value having local secondary indexes that help retrieve items within a single partition key.

We can convert those to global secondary indexes instead and that would enable us to query on unshipped items in an order for example. The underlying principle to remember is that DynamoDB can scale out as you add more data, but that the design of your table matters a lot.

If you expect your data to stay within one partition, to stay under 10 GB or under 1000 capacity units, than these table design issues shouldn't matter much. But if you expect your data will eventually be split across many partitions, you'll need to make sure your data stays balanced across the different partition keys or you'll start running into problems as you scale.

About the Author

Ryan is the Storage Operations Manager at Slack, a messaging app for teams. He leads the technical operations for Slack's database and search technologies, which use Amazon Web Services for global reach.

Prior to Slack, Ryan led technical operations at Pinterest, one of the fastest-growing social networks in recent memory, and at Runscope, a debugging and testing service for APIs.

Ryan has spoken about patterns for modern application design at conferences including Amazon Web Services re:Invent and O'Reilly Fluent. He has also been a mentor for companies participating in the 500 Startups incubator.