Big Data Storage
The course is part of this learning path
Course two of the Big Data Specialty learning path focuses on storage. In this course, we outline the key storage options for big data solutions. We determine data access and retrieval patterns, and some of the use cases that suit particular data patterns such as evaluating mechanisms for capture, update, and retrieval of catalog entries. We learn how to determine appropriate data structure and storage formats, and how to determine and optimize the operational characteristics of a Big Data storage solution.
Amazon Aurora is now MySQL and PostgreSQL-compatible.
- Recognize and explain big data access and retrieval patterns.
- Recognize and explain appropriate data structure and storage formats.
- Recognize and explain the operational characteristics of a Big Data storage solution.
This course is intended for students looking to increase their knowledge of the AWS storage options available for Big Data solutions.
While there are no formal prerequisites for this course, students will benefit from having a basic understanding of cloud storage solutions. Our courses on AWS storage fundamentals and AWS database fundamentals will give you a solid foundation for taking this present course.
This Course Includes
- Over 90 minutes of high-definition video.
- Real-Life Scenarios using AWS Reference Architecture
What You'll Learn
- Course Intro: What to expect from this course.
- Amazon DynamoDB: How you can use Amazon DynamoDB in Big Data scenarios.
- Amazon DynamoDB Reference Architecture: A real-life model using DynamoDB
- Amazon Relational Database Service: A look at how Amazon RDS works and how you can use it in Big Data scenarios.
- Amazon Relational Database Service Reference Architecture: A real-life model using RDS.
- Amazon Redshift: An overview of Amazon Redshift works and how you can use it in Big Data scenarios.
- Amazon Redshift Reference Architecture: A real-life model using Redshift.
Welcome to Big Data on AWS. Storing data using Amazon DynamoDB. At the end of this module, you will be able to describe in detail how Amazon DynamoDB can be used to store data within a Big Data solution.
So let's have a look at the first of the Amazon storage services we will cover. Amazon DynamoDB, and discuss how it works, and how you can use it in a Big Data scenario. Amazon DynamoDB is primarily designed to store data, and service access to this data, to an application or a downstream analytical process. DynamoDB is a NoSQL data storage service, which means it does not operate on a relational model, and therefore cannot be queried using ANSI standard SQL. It's similar to other NoSQL solutions, such as MongoDB. When choosing a Big Data storage solution, from within the available Amazon service offering, it is important to determine whether the data sources we are primarily storing contain structured, semi-structured, or unstructured data. This will typically drive the decision on which AWS service is best for that data pattern or use case.
Amazon DynamoDB is primarily designed to manage semi-structured data. DynamoDB tables do not have a fixed schema. They are schemaless. So each data item can have a different number of attributes. The data item and a table do not need to have the same attributes, or even the same number of attributes, which is very different to a relational database, which does require the attributes in a table to be consistent.
Amazon DynamoDB is targeted at providing transactional processing patterns at a speed and a scale that relational databases can not achieve. When choosing a Big Data processing solution from within the available Amazon service offerings, it is important to determine whether you need the latency of response from the process to be in seconds, minutes, or hours. This will typically drive the decision on which Amazon service is the best for that processing pattern or use case. Amazon DynamoDB is primarily designed to deliver transaction orientated processing, but at such a low level of latency, that it provides near real-time processing.
Amazon DynamoDB is a NoSQL database in the cloud suitable for anyone needing a reliable and fully managed NoSQL solution. DynamoDB is designed to provide automated storage scaling and low latency. It is particularly useful when your application must read and store massive amounts of data, and you need speed and reliability. With DynamoDB, you can create database tables that can store and retrieve any amount of data, and serve any level of request traffic. You can scale up or scale down your table's throughput capacity without downtime or performance degradation.
Amazon DynamoDB is based on a platform as a service style architecture, where you determine the throughput or the capacity you require, and the architecture and components are automatically provisioned, installed, and configured for you. You have no need or ability to change the way these architectural components are deployed. Unlike some of the other Amazon Big Data services, which have a container that the service sits within, for example, DB instance within Amazon RDS, Amazon DynamoDB doesn't. The container is effectively the combination of the account and the region you are provisioning the DynamoDB tables within. Within the account, there is a concept of a table. In DynamoDB, data is stored in tables, which are similar in concept to tables in other database management systems. Within a table, there are a concept of an item and attributes, which we will discuss later. Apart from the tables, the only other architectural component we need to be aware of is the partition key.
All of your DynamoDB data is stored on solid-state disks, on SSDs. DynamoDB automatically spreads the data and traffic for your tables over a sufficient number of servers to handle your throughput and storage requirements, while maintaining consistent and fast performance. So, we have no control over the number of servers, instance sizes, or storage IOPS, like we do with other Amazon Big Data services. For DynamoDB, the focus and pricing is based on how many reads and writes you provision, and we will discuss the read-write provision throughput concept in more detail shortly. DynamoDB automatically replicates the data across three availability zones in the Amazon region, similar to Amazon S3, and therefore provides built-in high availability and data durability.
So let's have a look at how you provision a DynamoDB. Well, as always, Amazon has made it incredibly simple for you to provision the capability. You choose a table name, and a primary key, and an optional sort key if you require, and then Amazon DynamoDB will create everything else for you. So as you can see, it is a very simple architecture that is surfaced with a large number of moving parts under the covers, which we never get to see, and we do not need to care about. The world of NoSQL database is emerging fast, and there are a large number of different technologies and solutions that live under the NoSQL banner.
These are typically grouped into four different NoSQL database patterns. A key-value database works by matching keys with values, similar to a dictionary. Data is stored as a key. For example, the answer to life, and it also stores a matching value. For example, 42. This data can be retrieved later by supplying the key and the value, i.e., 42 will be returned. There is no structure nor relation of the data stored, apart from the fact that there is a key and a value that relates to that key. In a document NoSQL database, data which is a collection of key value pairs is compressed as a document store quite similar to a key-value store, but the main difference is that the value stored referred to as documents provide some structure in encoding the stored values.
Examples are storing XML or JSON objects. In a column orientated NoSQL database, data is stored cells grouped in columns of data, rather than rows of data. They are effectively two dimensional arrays, whereby each key row has one or more key-value pairs attached to it. This approach allows for very large and semi-structured data to be kept and used. The graph-based NoSQL models represent the data in a completely different way than the previous three patterns. They use tree-like structures, i.e., graphs, with nodes and edges connecting each other through relations.
These databases are commonly used by applications where clear boundaries for connections are necessary to establish. For example, when you register to a social network of any sort, your friends connections to you, and their friends' friends' relation to you are much easier to work out with using graph-based database management systems. Amazon DynamoDB supports key-value data structures. Each item or row is a key-value pair, with a primary key as the only required attribute for items in a table, and uniquely identifies each item. DynamoDB is schemaless. Each item can have any number of attributes or columns.
Amazon DynamoDB also supports the document data structures, allowing you to install entire JSON format documents in DynamoDB, up to the 400KB maximum document size. One of the interesting things is that today NoSQL databases are probably better described as not only SQL, rather than NoSQL. So how do NoSQL and relational databases compare? NoSQL databases typically do not enforce a schema. A partition key is generally used to retrieve values, column sets, or semi-structured JSON or XML.
The relational model, on the other hand, normalizes data into tabular structures known as tables, which consist of rows and columns. A schema strictly defines the tables, columns, indexes, relationships between the tables, and other database elements. Dynamo does not follow the typical ACID properties of Atomicity, Consistency, Isolation, and Durability. Instead, most NoSQL databases offer a concept of eventual consistency, in which database changes are propagated to all nodes eventually, typically within milliseconds, so queries for data might not return updated data immediately, or might result from reading data that is not accurate, a problem that is known as stale reads. NoSQL databases are designed to scale out using distributed clusters of low-cost hardware, to increase throughput without increasing latency. Relational databases, on the other hand, are typically easiest to scale up with faster hardware, or via a grid-style architecture, which is implemented to enable relational tables to span a distributed system.
In DynamoDB, object-based APIs allow developers to easily store and retrieve data via a web service course, over HTTP or HTTPS connections. These connections are stateless. Requests to store and retrieve data from a relational database are communicated using queries, which conform to a structured query language, SQL via a stateful connection. These queries are paused and executed by the relational database management system. NoSQL databases generally offer tools to manage the database and scaling. Custom applications are the primary interface to the underlying data, rather than readily available query tools. In the relational database world, there are a rich set of SQL tools available to create, manage, and query relational data. Similar to other database management systems, DynamoDB stores data in tables.
A table is a collection of data. For example, you could create a table named people, where you can store information about your customers, your prospects, your employees, or anyone else of interest. You can also have a products table to store information about the products that the customers have purchased, the products prospects have viewed on your website, or the products employees have placed on a shelf within a warehouse.
In DynamoDB, you are limited to 256 tables per region. Each table contains multiple items. An item is a group of attributes that is uniquely identifiable amongst all the other items. In a people table, each item represents one person. For a products table, each item would represent one product. Items are similar to rows in a relational database system.
In DynamoDB, there is no limit to the number of items that you can store in a table. Each item is composed of one or more attributes. An attribute is a fundamental data element, something that does not need to be broken down any further. For example, a product item might have attributes, such as product ID, product name, product type, and so on. An item in a people table can contain attributes, such as person ID, last name, first name, address and so on. Attributes in DynamoDB are similar in many ways to columns in other relational systems. Attributes are the lowest level of data. They are something that does not need to be broken down further; however, attributes can be nested, which means you can store attributes within an attribute, up to 32 levels deep.
For example, within a person item, you could have an attribute of address, which holds a nested set of attributes, such as street, city, and postal code. When you create a table, in addition to the table name, you must specify the primary key of the table. As in other databases, a primary key in DynamoDB uniquely identifies each item in the table, so that no two items can have the same key. When you add, update, or delete an item in the table, you must specify the time, primary key attribute value for that item. The key values are required. You cannot create an item without having a primary key for it. So in this example, the primary key is the person ID.
Dynamo supports two different types of primary keys. A partition key, a simple primary key composed of one attribute know as partition key. Dynamo uses the petition key's value as an input to an internal hash function. The output from the hash function determines the partition where the item is stored. Partitioning of the data is one of the techniques DynamoDB uses to provide fast performance when querying data. With a simple primary key, no two items in a table can have the same partition key value. Partition key and sort key. This is the composite primary key, comprised of two attributes. The first being the partition key, and the second attribute being the sort key. With this composite key, DynamoDB still uses the partition key value as an input to an internal hash function, and the output from the hash function determines the partition where the data is stored. All items within the same partition key are stored together in the same partition, and they are stored in the sort order by the sort key value. With a composite primary key, it is possible for two items to have the same partition key value, but those two items must have different sort key values.
So in the music example, you could create a composite primary key of artist and song title. All the data for the artist would be stored in the same partition, and then sorted by the song title. The combination of artist and song title must always be unique, as it is the key. Note: The partition key is also known as the hash attribute, and the sort key is also known as the range attribute. Each primary key must be scalar type, which means it can only hold a single value, and only string, number, or binary values allowed.
In DynamoDB, you can query the data in a table using the primary key. You can also create one or many secondary indexes on a table to allow you to query the data using these alternate keys. DynamoDB does not require that you use these secondary indexes, but they give your applications more flexibility when it comes to querying your data. So if we look at this third example, we might want to query the music data by music genre and album title. To do this, we would create a secondary index based on these two attributes, which would allow us to query this data based on those attributes.
You do not need to create a secondary index to be able to query the data based on these attributes; however, if you do not define a secondary index, the query will have to scan the entire table. And for a table with millions of items, this would consume a large amount of provision read throughput, and take a long time to complete. DynamoDB supports two types of indexes, a local secondary index and an index that has the same partition key as the table, but a different sort key, as well as a global secondary index, which is an index which has a partition key and a sort key, that can be different from those on the table. Every secondary index is associated with exactly one table from which it obtains its data. This is called the base table for that index.
It's important to note that you can define a maximum of five global and five local secondary indexes per table. When you define a secondary index, the data for that index is stored separately from the data in the base table. It's also important to note that when applications write an item to a table, DynamoDB automatically copies the correct subset of the attributes to any local secondary indexes in which those attributes should appear. Your AWS account is charged for the storage of the item in the base table, and also for storage of attributes in any local secondary indexes on that table.
After you create a secondary index on a table, you can read data from the index in much the same way as you do from the table. By using secondary indexes, your applications can efficiently use many different query patterns, in addition to accessing the data by the primary key values. Amazon DynamoDB supports many different data types for attributes within a table.
The data types are typically categorized as one of three data types: scalar, document, or set. A scalar type can represent exactly one value. The scalar types are: number, string, binary, boolean, and null. A document type can represent a complex structure with nested attributes, such as you would find in a JSON document. These document types are: list and map.
A document list type attribute can store an ordered collection of values. Lists are enclosed in square brackets, and are similar to a JSON array. A document map type attribute can store an unordered collection of name value pairs. Maps are enclosed in curly braces, and are similar to a JSON object.
Maps are ideal for storing JSON documents in DynamoDB. A set type can represent multiple scalar values. The set types are: string set, number set, and binary set. Each value within a set must be unique, and all elements within a set must be of the same type. For an example, with a number set, all values must be numbers. DynamoDB doesn't offer the wide range of data types that many relational databases do, so there are a few things to note when using the different data types.
The string data type attribute are constrained by the maximum DynamoDB item size limit of 400KB. These document data types can be nested within each other to represent complex data structures up to 32 levels deep. The binary type attributes can store any binary data, such as compressed text, encrypted data, or images.
One data type of note to watch out for is dates and timestamps. For dates and timestamps, you will need to represent those as strings or numbers, in order to store them in DynamoDB. Provision throughput controls how fast your DynamoDB environment performs, and how much you get charged. Provision throughput refers to the level of read and write capacity that you want Amazon to reserve for your tables. You are charged for the total amount of throughput that you configure for your tables, plus the total amount of storage space used by your data.
During creation of a table, you specify your required read and write capacity needs, and Amazon DynamoDB automatically positions partitions and reserves the appropriate amount of resources to meet your throughput requirements. You can increase your provision throughput as often as you want, but be careful, as you can only decrease it four times per day. In DynamoDB, you specify provision throughput requirements in terms of capacity units. A unit of read capacity enables you to perform one strongly consistent read per second, or two eventually consistent reads per second, of items up to 4KB in size. Large items will require more capacity.
So for example, if the item is greater than 4KB, say 40KB, then you will need ten read capacity units. To make it slightly trickier, you can have an eventually consistent read or a strongly consistent read. As your DynamoDB is stored on multiple AZs in a region, when you do a write, DynamoDB will update the data stored in each AZ. The data will eventually be consistent across all the storage locations, usually within one second or less.
But, there is a point in time where that data is not consistent. When you read data from a DynamoDB table, the response might not reflect the results of a recently completed write operation. The response might include some stale data. If you repeat your read request after a short time, the response should return the latest data. This is called an eventually consistent read. When you request a strongly-consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful. A strongly consistent read might not be available in the case of a network delay or outage. This is called a strongly consistent read.
Note: DynamoDB uses eventually consistent reads unless you specify otherwise. Read operations provide a consistent read parameter. If you set this parameter to true, DynamoDB will use strongly consistent reads during the operation. If you use eventually consistent reads, you'll get twice the throughput in terms of reads per second. A unit of write capacity enables you to perform one write per second for items up to 1KB in size. Y
ou can calculate the number of units of read and write capacity you need, by estimating the number of reads or writes you need to do per second, and multiply them by the size of your items, rounded to the nearest kilobyte. For tables with secondary indexes, DynamoDB will consume additional capacity units.
For example, if you wanted to add a single 1KB item to a table, and that item contained an index attribute, you would need to write two capacity units, one for writing to the table, and another for writing to the index. Although DynamoDB performance can scale up as your needs grow, your performance is limited to the amount of read and write throughput that you provision for each table. If you expect a spike in DynamoDB use, you will need to provision more throughput in advance, or database requests will fail with a provision throughput exceed exception error. DynamoDB stores data in partitions. A partition is an allocation of storage for a table backed by solid-state drives, and automatically replicated across multiple Availability Zones within an Amazon region.
Partition management is handled entirely by DynamoDB. You do not have to configure or manage the partitions. When you create a new table, DynamoDB allocates the table's partitions according to the provision throughput settings that you specify. A single partition can hold up to 10GB of data, and can support a maximum of 3,000 read capacity units, or 1,000 write capacity units. If you increase a table's throughput provision, and the tables current provisioning schema cannot accommodate your new requirements, DynamoDB will double the current number of partitions. To write an item to the table, DynamoDB uses the value of the partition keys as an input to an internal hash function. The output value from the hash function determines the partition in which the item will be stored. To read an item from the table, you must specify the partition key value for the item. DynamoDB uses this value as an input to its hash function, returning the partition on which the item can be found.
To assist the partitioning algorithm, Amazon recommends that choose a partition key that can have a large number of distinct values relative to the number of items in the table. Indexes are also partitioned, just like base tables. You need to load your data into Amazon DynamoDB before you can query or use it. Data is normally loaded into DynamoDB via the application that is using the NoSQL database as its back-end service.
Alternatively, you can use the Amazon console, and the command line interface, the CLI, to import data from Amazon S3 into a DynamoDB table. Another option is to use Amazon Data Pipeline, which uses a combination of Amazon S3 and Amazon EMR to load the data into DynamoDB.
There are also a number of third party tools, such as RazorSQL, that allow you to load data into DynamoDB. To access the DynamoDB data, your application or reporting tool needs to send an HTTPS request to DynamoDB. The request contains the name of the DynamoDB operation to perform, along with the parameters. DynamoDB executes requests immediately, and returns an HTTPS response, containing the results of the operation.
You cannot access DynamoDB using a standard ODBC or JDB connection or via a standard SQL query, which means your standard BI tools cannot directly access DynamoDB data sources. There are, however, third party offerings, like CDATA, that provide an ODBC style driver that will talk to DynamoDB, effectively becoming an interpreter, and other BI tool vendors, such as Sisense, have also created a DynamoDB access engine for their tool.
Amazon are planning to enable access to DynamoDB data from Amazon QuickSight in the future. DynamoDB is a non-relational NoSQL database, and does not support table joins. Instead, applications read data from one table at a time. So if you wish to join data, you will need to retrieve the data from each table, and join it in your application or reporting tool.
There is an open source framework in the AWS lab GitHub container, named emr-dynamodb-connector. This connector allows you to access data in Amazon DynamoDB using Apache Hadoop, Apache Hive, or Apache Spark in Amazon EMR. You can process data directly in DynamoDB using these frameworks, or join data in DynamoDB with data in Amazon S3, Amazon RDS, or other storage layers that can be accessed by Amazon EMR.
There are a number of limits within the Amazon DynamoDB service you need to be aware of. The default limits listed on the left in the table are set by AWS globally. You can request that AWS increase these quotas for a specific account, in a specific region, by requesting a limit increase.
Three important limitations within Amazon DynamoDB are the maximum record size of 400KB, the 256 table limit, and the limit of 10 indexes per table. These limits are all per region. There are a number of use cases where Amazon DynamoDB is the perfect storage solution, and a number where an alternate Amazon solution would potentially provide a better solution. Amazon DynamoDB delivers seamless throughput and storage scaling via API, and the Amazon Management Console.
There is virtually no limit on how much throughput of storage that you can dial up at a time. Amazon DynamoDB provides a predictable, low latency response time for storing and accessing data at any scale. Amazon DynamoDB can store both structured and semi-structured data, but is designed to stored any amount of semi-structured data, and to be able to read, write and modify it quickly, efficiently, and with predictable performance.
If you want to access large volumes of data via a traditional BI tool, then Amazon Redshift provides a more effective Amazon Big Data service, as it allows the BI tools to use their standard SQL query capabilities to access this data. If you need to run advanced analytic algorithms, then Amazon EMR or Amazon Machine Learning are better solutions than Amazon DynamoDB.
About the Author
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.