Introduction to Partitioning

The course is part of these learning paths

DevOps Engineer – Professional Certification Preparation for AWS
course-steps 35 certification 5 lab-steps 18 quiz-steps 2 description 3
Working with AWS Databases
course-steps 4 certification 2 lab-steps 4
Certified Developer – Associate Certification Preparation for AWS
course-steps 29 certification 5 lab-steps 22 description 2
AWS Big Data – Specialty Certification Preparation for AWS
course-steps 14 certification 1 lab-steps 4
Serverless Computing on AWS for Developers
course-steps 12 certification 1 lab-steps 8
more_horiz See 3 more
play-arrow
Start course
Overview
DifficultyIntermediate
Duration1h 32m
Students9770
Ratings
4.7/5
star star star star star-half

Description

Course Description

This course provides an introduction to working with Amazon DynamoDB, a fully-managed NoSQL database service provided by Amazon Web Services. We begin with a description of DynamoDB and compare it to other database platforms. The course continues by walking you through designing tables, and reading and writing data, which is somewhat different than other databases you may be familiar with. We conclude with more advanced topics including secondary indexes and how DynamoDB handles very large tables.

Course Objectives

You will gain the following skills by completing this course:

  • How to create DynamoDB tables.
  • How to read and write data.
  • How to use queries and scans.
  • How to create and query secondary indexes.
  • How to work with large tables. 

Intended Audience

You should take this course if you have:

  • An understanding of basic AWS technical fundamentals.
  • Awareness of basic database concepts, such as tables, rows, indexes, and queries.
  • A basic understanding of computer programming. The course includes some programming examples in Python.

Prerequisites 

See the Intended Audience section.

This Course Includes

  • Expert-guided lectures about Amazon DynamoDB.
  • 1 hour and 31 minutes of high-definition video. 
  • Expert-level instruction from an industry veteran. 

What You'll Learn

Video Lecture What You'll Learn
DynamoDB Basics A basic and foundational overview of DynamoDB.
Creating DynamoDB Tables How to create DynamoDB tables and understand key concepts.
Reading and Writing Data How to use the AWS Console and API to read and write data.
Queries and Scans How to use queries and scans with the AWS Console and API.
Secondary Indexes How to work with Secondary Indexes.
Working with Large Tables How to use partitioning in large tables.

If you have thoughts or suggestions for this course, please contact Cloud Academy at support@cloudacademy.com.

Transcript

The final segment of this course will discuss how to work with DynamoDB tables that get very large.

When a table gets too large for DynamoDB to handle efficiently, it doesn't get slow or stop you from adding more data, instead, DynamoDB transparently splits your table into smaller segments or partitions. When we say that DynamoDB can scale infinitely large, this is how it does that. There is no practical limit to the size of a table. I've seen AWS customers create tables with trillions of rows which take up terabytes of storage space. DynamoDB will split your table into partitions when the size of the data grows larger than 10 gigabytes, or when the total number of read and write capacity units exceeds a certain level.

This is why the primary key for a DynamoDB item is called a partition key. The partition key is used to determine which partition an item should be stored in. DynamoDB knows how to map partition keys to the physical partitions that hold the data on their servers so you don't need to think too much about the physical partitions themselves.

When you have a table with a compound primary key like we did for our order line items table, then all of the records with the same partition key are stored in the same physical partition, on the same DynamoDB server. One order might have 10 line items which will all be stored together in one partition. Another order might have five line items. They're stored together too, maybe in the same physical partition or maybe in a different one. It's often helpful to be aware of how many partitions your table will have.

Amazon does not make this number visible to customers. Not in the web console, and not through the API. You can sometimes get the exact number by contacting AWS Premium Support but you can also estimate it yourself. Amazon does publish the formula for how many partitions your table will have, but it takes some thought to understand the nuances. It's a function of the size of the table and the amount of I/O capacity that you've provisioned for it. The number of partitions can be estimated by choosing whichever of these values is larger. Either the table size, the storage space divided by 10 gigabytes, or the sum of the read capacity units divided by 3000 plus the write capacity units divided by1000.

Let's go through some examples to see how this formula works in practice. Here's an example of a small, busy table. The table size is under 10 gigabytes so it should fit into one partition. But we've allocated 15,000 read capacity units and 2000 write capacity units. So it's actually going to be split into seven partitions.

Here's an example of a larger table where the number of partitions is a function of the table size. Based on the number of capacity units provisioned it shouldn't need more than a couple of partitions, but the table is 380 gigabytes large so it's going to use 38 partitions.

Be careful doing a bulk import when you create a table. If you provision too much write capacity, your table will be split into a bunch of partitions even before you have any data. In this example, the table has no data, very few read capacity units, but 10,000 write capacity units. Because of this, the table is actually going to be split into more than 10 partitions just because of the number of write capacity units that were provisioned in order to import data quickly. Partitions can be split the table's number of partitions can grow but they can never be joined or reduced. So in this example, you'd be stuck with at least 11 partitions for all time.

Another example, be careful when you're doing a bulk read operation like exporting data. If you add too much read capacity, your table may repartition unexpectedly. In this example, the table had 4 gigabytes of data and 600 read capacity units and 200 write capacity units. So it fit within one partition just fine. But when we added 30,000 read capacity units, in order to do an export very quickly, that actually caused the table to be split into 11 different partitions.

One last thing to keep in mind. If your table grows naturally then the number of partitions will usually wind up being a power of two. That's because when the partitions grow naturally they each split independently. So you'll split from one partition into two then as those two grow it will go from two partitions into four, and so on.

Let's walk through how the splitting happens which will make it clearer why it often ends up being a power of two. We'll start with a small table here with just a few rows. Since it's small, this table has just one partition. Amazon doesn't give us much visibility into our partitions or their names or IDs, so let's just call this one "Unicorn Partition." Over time the data in unicorn partition will get larger. Or you might add more provision capacity. At some point, your partition will hit the limits of what DynamoDB can handle in a single partition. When that happens, DynamoDB will start copying your data into two new partitions. Let's call these cat and bee. This happens transparently. You don't have to do anything to trigger this and in fact, you don't get any notification that it's happening. But it shouldn't affect your application. So it's not something that you generally need to be concerned with. Eventually, the data will finish migrating to the new partitions and DynamoDB will clean up the old partition. Now you're left two partitions cat and bee each with half as much data as before. At this point, each partition should have exactly half of the data in the table, right? Well, not exactly half the data. Rather, each partition will have half of the partition key space. Half of the partition keys will be assigned to the left partition and the other half will be assigned to the right partition. As a user, we don't know exactly how DynamoDB chooses which data goes into which partitions.

Behind the scenes, DynamoDB uses a hash algorithm to choose which partition each item belongs in. That's why partition keys are sometimes called hash keys. But because Amazon doesn't share the hash algorithm with its customers, that's why I'm using the Greek letters here to denote that it's not clear just from the values you put in your partition keys. If this table held our order line items, maybe orders 1 to 1000 would be in the left partition, and 1,001 to 2,000 would be in the right one. Or maybe even order IDs would be on the left, and odd order IDs would be on the right. All we know is that half of the partition keys will be assigned to each partition. And again, they'll start growing and hitting the limits of a single partition. Although maybe not at exactly the same time. And as each partition hits its limits it will start splitting into even more partitions. DynamoDB will create two new partitions and migrate the data out of cat partition. Each of the new partitions will be assigned half of the partition keys from the cat partition but again, we don't know exactly which partition keys will go into which of the new partitions. Then it will clean up cat partition, leaving us with three partitions. And as bee partition starts to grow, it will hit its limits too and it will get split into smaller partitions as well.

This process can happen over and over which is how DynamoDB can handle tables so massively large. Now if your partitions grow evenly then keep in mind that each one will split around the same time. Remember each partition can hold 10 gigabytes of data, so a table with 19 or 20 gigabytes will be stored in two partitions. But when the data grows beyond 20 gigabytes, then if it's distributed evenly across both partitions, then each of them will have grown to just over 10 gigabytes so both of them will split around the same time. Your data will now be 20 or 21 gigabytes, but you'll have 4 partitions, not just 3.

So how many partitions will your table have? Well, it will usually be a power of two. Say you have a table with 400 gigabytes of data. That's not going to be 40 partitions, it's going to be rounded up to 64. Sounds good, right? Well, as you start to have many partitions you can actually start running into problems. These problems can happen when your data or when your read and write traffic are not evenly distributed across all your partitions.

In the earlier videos, you've seen us provision read capacity and write capacity for our table. Well, when a table is split into partitions, each partition is given an equal slice of read capacity and write capacity. If your workload isn't evenly distributed across those partitions, you're going to run into problems. The next video will explain how those problems can happen, and what you can do about it.

About the Author

Ryan is the Storage Operations Manager at Slack, a messaging app for teams. He leads the technical operations for Slack's database and search technologies, which use Amazon Web Services for global reach.

Prior to Slack, Ryan led technical operations at Pinterest, one of the fastest-growing social networks in recent memory, and at Runscope, a debugging and testing service for APIs.

Ryan has spoken about patterns for modern application design at conferences including Amazon Web Services re:Invent and O'Reilly Fluent. He has also been a mentor for companies participating in the 500 Startups incubator.