image
Understanding Secondary Indexes
Start course
Difficulty
Intermediate
Duration
1h 32m
Students
20917
Ratings
4.6/5
Description

Please note this course is outdated and has been replaced with the following courses:

 

This course provides an introduction to working with Amazon DynamoDB, a fully-managed NoSQL database service provided by Amazon Web Services. We begin with a description of DynamoDB and compare it to other database platforms. The course continues by walking you through designing tables, and reading and writing data, which is somewhat different than other databases you may be familiar with. We conclude with more advanced topics including secondary indexes and how DynamoDB handles very large tables.

Course Objectives

You will gain the following skills by completing this course:

  • How to create DynamoDB tables.
  • How to read and write data.
  • How to use queries and scans.
  • How to create and query secondary indexes.
  • How to work with large tables. 

Intended Audience

You should take this course if you have:

  • An understanding of basic AWS technical fundamentals.
  • Awareness of basic database concepts, such as tables, rows, indexes, and queries.
  • A basic understanding of computer programming. The course includes some programming examples in Python.

Prerequisites 

See the Intended Audience section.

This Course Includes

  • Expert-guided lectures about Amazon DynamoDB.
  • 1 hour and 31 minutes of high-definition video. 
  • Expert-level instruction from an industry veteran. 

What You'll Learn

Video Lecture What You'll Learn
DynamoDB Basics A basic and foundational overview of DynamoDB.
Creating DynamoDB Tables How to create DynamoDB tables and understand key concepts.
Reading and Writing Data How to use the AWS Console and API to read and write data.
Queries and Scans How to use queries and scans with the AWS Console and API.
Secondary Indexes How to work with Secondary Indexes.
Working with Large Tables How to use partitioning in large tables.

If you have thoughts or suggestions for this course, please contact Cloud Academy at support@cloudacademy.com.

Transcript

In the last lesson, we discussed how you can only perform a query against an indexed column unless you're willing to scan through the entire table. So all of the queries we route filtered the data that they returned by the partition key, part of the primary key. But what if you want to filter your data by another attribute in a table? We've shown how to look up individual orders by their order ID, but sometimes you want to find all of the orders for a customer. Or we've seen how to look at order line items by order ID, but what if you want to find all the unshipped items? Or all the recent orders of a particular product? Well, that's where secondary indexes come in.

DynamoDB lets you create additional indexes so that you can write queries to search your data by other attributes. If you've worked with relational databases, you've probably used indexes with those. But there are a couple of big differences in how indexes operate in DynamoDB. First, each query can only use one index. If you want to query and match on two different columns, you need to create an index that can do that properly. Second, when you write your queries you need to specify exactly which index should be used for each query. It's not like a relational database that has a query analyzer, which can automatically decide which indexes to use for our query. Here, you need to be explicit and tell DynamoDB what index to use.

DynamoDB has two different kinds of secondary indexes. Global secondary indexes lets you query across the entire table to find any record that matches a particular value. By contrast, local secondary indexes can only help find data within a single partition key. Both of these are useful, but the way you work with them is very different.

Lets walk through some examples that will help make it easier to understand when to use each type. This is a simplified view of our orders table. The order ID is the partition key for the table. But what if we want to be able to search for orders by customer? Say to find all the orders for customer ID 40343. That's when we would add a global secondary index on the customer ID attribute. By setting up the index like that, we could find all of the orders for that customer 40343, for example.

Behind the scenes, these global secondary indexes are just like DynamoDB tables of their own. A global secondary index has its own partition key and also contains other attributes like the order ID, which is needed to match the records to items in the main table. Global secondary indexes use storage space that's separate from the main table, and in fact you have to configure provision throughput for each global secondary index, just like you do for a table. When you query the index, DynamoDB retrieves the data that's included in the index. So if you queried this secondary index for orders from customer 40343, DynamoDB would return three order IDs. But you probably want your query to return all the attributes from the main table. You can include those attributes in the index. This is called projecting attributes into the index.

By default, all of the attributes in the table are projected into the index. This way, when you query the index you can retrieve all of those attributes along with the keys. But this uses more space and more throughput than an index that only includes its own keys and the keys of the table. So you have the option to choose what to project into the index. You can project everything, you can project nothing except for the keys, or you can specify exactly which attributes should be projected into the index.

Global secondary indexes can also have sort keys, not just partition keys. This would help if, for example, we often wanted to find the most recent orders for a given customer. In that case, we could add a sort key to this secondary index, which would be the order date. If we did that, then within each partition key, within each customer ID, the records would be sorted by the order date. This would make it possible to query for the five most recent orders for a particular customer.

Since global secondary indexes are so similar to tables, technically you could actually avoid using them altogether and just create them yourself by making separate tables. All the querying would look exactly the same as it does with a global secondary index, but the big advantage to using indexes is that DynamoDB keeps the indexes updated for you automatically. When you insert or modify data in the main table, DynamoDB will automatically ensure that the index receives the same updates at the same time. It's sort of like having a handful of tables that are automatically linked together.

Here's another example using our order line items table. As a reminder, this table has a composite primary key. Together, the order ID and the line number make up the unique identifier for each item. Let's say that we wanted to add an index on the product ID column so that we could quickly find the orders that include a particular product.

Here's how that index would be organized. The product ID attribute would be the partition key for the index. So now you can query for all orders matching a particular product ID. Say we wanted to find unshipped orders for a given product. We could filter this and scan through all of the orders for a product ID to find the ones that are unshipped, but we could also add the status column as a sort key. The sort key would become an optional way to match a second column in the queries. So that would allow us to find unshipped orders for a particular product, and here's how that would look. The index's partition key would stay the product ID, and the sort key would be status. This would make it very quick and easy to find all shipped or all unshipped orders for a product. But it still wouldn't let you query for all the unshipped items across all orders because queries must specify a single partition key. If you wanted to find all unshipped items across all orders, you'd need to add another index with status as the partition key.

We've shown some examples of how to use global secondary indexes, but how would you use local secondary indexes? Well, local secondary indexes live within each partition key and help you filter data within that partition. This means that they're not useful unless you have a compound primary key. If you don't, then there would never be more than one item with a given partition key, so there's no reason you want a local secondary index.

Let's take another look at our order line items table. Here's the table again, grouped by its partition key. What if you wanted to find all of the unshipped items for a specific order? Well, you could query for all the line items in that order and use a filter to exclude the items that have already been shipped. But filtering tends to be expensive, and so instead we might want to use a local secondary index. If we add a local secondary index on the status attribute, then DynamoDB can use the index to identify exactly which items have been shipped and which ones have not.

Another way to think about local secondary indexes is that they're similar to global indexes, except that the indexes partition key must be the same as the table's partition key. If your queries will work by limiting the results to a single partition key value, then you should use a local secondary index. If you need to query across all partition key values, then use a global secondary index.

I want to point out a couple other differences between the two types of indexes. Both types of indexes can be created at the same time the table is created, but only global secondary indexes can be added or changed or removed later. If you want to add a local secondary index to an existing table, or remove a local secondary index, you would need to delete the table and start over. And when you create a global secondary index, you have to revision separate throughput capacity for the index.

By contrast, local secondary indexes share their provision throughput with the main table. Although you don't have to provision any extra throughput for the local secondary indexes separately, remember that you are using additional capacity units in order to update indexes or every time you query the index. So you'll still need to make sure that you have enough capacity available for your workload.

There's also a limit to the number of indexes on each table. You can only create five global secondary indexes and five local secondary indexes for each table. This is a hard limit, and as far as I know it can't be changed. In the next two videos we'll go through the process of creating secondary indexes and querying them.

About the Author

Ryan is the Storage Operations Manager at Slack, a messaging app for teams. He leads the technical operations for Slack's database and search technologies, which use Amazon Web Services for global reach.

Prior to Slack, Ryan led technical operations at Pinterest, one of the fastest-growing social networks in recent memory, and at Runscope, a debugging and testing service for APIs.

Ryan has spoken about patterns for modern application design at conferences including Amazon Web Services re:Invent and O'Reilly Fluent. He has also been a mentor for companies participating in the 500 Startups incubator.