AWS Data Pipeline
In course one of the AWS Big Data Specialty Data Collection learning path we explain the various data collection methods and techniques for determining the operational characteristics of a collection system. We explore how to define a collection system able to handle the frequency of data change and the type of data being ingested. We identify how to enforce data properties such as order, data structure, and metadata, and to ensure the durability and availability for our collection approach
- Recognize and explain the operational characteristics of a collection system.
- Recognize and explain how a collection system can be designed to handle the frequency of data change and the type of data being ingested.
- Recognize and identify properties that may need to be enforced by a collection system.
This course is intended for students looking to increase their knowledge of data collection methods and techniques with big data solutions.
While there are no formal prerequisites, students will benefit from having a basic understanding of analytics services available in AWS. Please take a look at our Analytics Fundamentals for AWS
This Course Includes
- 45 minutes of high-definition videos
- Live hands-on demos
What You'll Learn
- Introduction to Collecting Data: In this lesson, we'll prepare you for what we'll be covering in the course; the Big Data collection services of AWS Data Pipeline, Amazon Kinesis, and AWS Snowball.
- Introduction to Data Pipeline: In this lesson, we'll discuss the basics of Data Pipeline.
- AWS Data Pipeline Architecture: In this lesson, we'll go into more detail about the architecture that underpins the AWS Data Pipeline Big Data Service.
- AWS Data Pipeline Core Concepts: In this lesson, we'll discuss how we define data nodes, access, activities, schedules, and resources.
- AWS Data Pipeline Reference Architecture: In this lesson, we'll look at a real-life scenario of how data pipeline can be used.
- Introduction to AWS Kinesis: In this lesson, we'll take a top-level view of Kinesis and its uses.
- Kinesis Streams Architecture: In this lesson, we'll look at the architecture that underpins Kinesis.
- Kinesis Streams Core Concepts: In this lesson, we'll dig deeper into the data records.
- Kinesis Streams Firehose Architecture: In this lesson, we'll look at firehose architecture and the differences between it and Amazon Kinesis Streams.
- Firehose Core Concepts: Let's take a deeper look at some details about the Firehose service.
- Kinesis Wrap-Up: In this summary, we'll look at the differences between Kinesis and Firehose.
- Introduction to Snowball: Overview of the Snowball Service.
- Snowball Architecture: Let's have a look at the architecture that underpins the AWS Snowball big data service
- Snowball Core Concepts: In this lesson, we'll look at the details of how Snowball is engineered to support data transfer.
- Snowball Wrap-Up: A brief summary of Snowball and our course.
Okay, let's have a look at the core concepts that underpin the Amazon Kinesis Streams big data service. An Amazon Kinesis Stream is an ordered sequence of data records. Each record in the stream has a sequenced number that is assigned by Kinesis Streams.
A record is the unit of data stored in the Amazon Kinesis Stream. A record is composed of a sequence number, partition key, and data blob.
A data blob is the data of interest your data producer adds to a stream. The maximum size of a data blob, the data payload before Base64 encoding, is one megabyte. A partition key is used to segregate and route records to different shards of a stream.
The Kinesis Streams server segregates the data records belonging to a stream into multiple shards using the partition key associated with each data record to determine which shard a given data record belongs to. Partition keys are unicoded strings with a maximum length of 256 bytes. An MD5 hash function is used to map partition keys to a 128-bit integer value and to map associated data records to shards. A partition key is specified by your data producer while adding data to an Amazon Kinesis Stream.
For example, assuming you have a stream of two shards, Shard One and Shard Two, you can configure your data producer to use two partition keys, Key A and Key B, so that all records within Key A are added to Shard One, and all records with Key B are added to Shard Two. A sequence number is a unique identifier for each record. Sequence numbers are assigned by Amazon Kinesis when a data producer calls PutRecord or PutRecords operation to add data to an Amazon Kinesis stream. Sequence numbers for the same partition key generally increase over time. The longer the time period between PutRecord or PutRecords requests, the larger the sequence number becomes. A shard is a group of data records in a stream.
When you create a stream, you specify the number of shards for the stream. Each shard can support up to five transactions per second for reads and up to a maximum total data read rate of two megabytes per second. And up to 1,000 records per second for writes and up to a maximum total data write rate of one megabyte per second, including partition keys. The total capacity of a stream is the sum of the capacity as shards. You can increase or decrease the number of shards in a stream as needed.
However, note that you are charged on a per shard basis. Before you create a stream, you need to determine an initial size for the stream. After you create the stream, you can dynamically scale your shard capacity up or down using the AWS management console or the update shard count API. You can make up dates while there is an Amazon Kinesis Stream application consuming data from the stream.
You can calculate the initial number of shards you need to provision using the formula at the bottom of the screen. Kinesis Streams support changes to the data record retention period for your stream. A Kinesis Stream is an ordered sequence of data records meant to be written to and read from in real time. Data records are, therefore, stored in shards in your stream temporarily. The time period from when a record is added to when it is no longer accessible is called the retention period.
And Kinesis Streams stores records from 24 hours by default up to 168 hours. Note though, additional charges apply for streams where the retention periods sit above 24 hours. Kinesis Streams supports re-sharding which enables you to adjust the number of shards in your stream in order to adapt to changes in the rate of data flow through the stream. There are two types of re-sharding operations.
A shard split and a shard merge. As the name suggests, in a shard split, you divide a single shard into two shards. In a shard merge, you combine the two shards into a single shard. You cannot split into more than two shards in a single operation. And you cannot merge more than two shards in a single operation. The shard or pair of shards that the re-sharding operation acts on are referred to as parent shards. The shard or pair of shards that result from the re-sharding operation are referred to as child shards. After you call a re-sharding operation, you need to wait for the stream to become active again.
Remember, Kinesis Streams is a real-time data streaming service which is to say that your application should assume that the data is continuously flowing through the shards in your stream. When you re-shard, data records that were flowing to the parent shards are rerouted to flow to the child shards based on the hash key values that the data record partition keys map to.
However, any data records that were in the parent shards before the re-shards remain in those shards. In other words, the parent shards do not disappear when the re-shard occurs. They persist along with the data they contained prior to the re-shard. A producer puts data records into Kinesis Streams. For example, a web server sending log data to a Kinesis Stream is a producer.
A consumer processes the data records from a stream. To put data into the stream, you must specify the name of the stream, a partition key, and the data blob to be added to the stream. The partition key is used to determine which shard in the stream the data record is added to. All the data in the shard is sent to the same worker that is processing the shard. Which partition key you use depends on your application logic. The number of partition keys should typically be much greater than the number of shards. This is because the partition key is used to determine how to map a data record to a particular shard. If you have enough partition keys, the data can be evenly distributed across the shards in a stream.
A consumer gets records from the Kinesis Stream. A consumer, known as an Amazon Kinesis Streams application, processes the data records from a stream. You need to create your own consumer applications. Each consumer must have a unique name that is scoped to the AWS account and region used by the application. This name is used as a name for the control table in Amazon Dynamo DB and the name space for Amazon CloudWatch metrics. When your application starts up, it creates an Amazon Dynamo DB table to store the application state, connects to the specified stream and then starts consuming data from the stream. You can view the Kinesis Streams metrics using the CloudWatch console. You can deploy the consumer to an EC2 instance. You can use the Kinesis Client Library, or KCL, to simplify parallel processing of the stream by a fleet of workers running on a fleet of EC2 instances. The KCL simplifies writing code to read from the shards in the stream and it ensures that there is a worker allocated to every shard in the stream.
The KCL also provides help with volt tolerance by providing check pointing capabilities. Each consumer reads from a particular shard using a shard iterator. A shard iterator represents the position in the stream from which the consumer will read. When they start reading from a stream, consumers get a shard iterator which can be used to change where the consumers read from the stream. When the consumer performs a read operation, it receives a batch of data records based on the position specified by the shard iterator. There are a number of limits within the Amazon Kinesis Streams service you need to be aware of.
Amazon Kinesis imposes limits on the resources you can allocate and the rate at which you can allocate resources. The displayed limits apply to a single AWS account. If you require additional capacity, you can use the standard Amazon process to increase the limits of your account where the limit is flagged as adjustable. Note the maximum number of shards differ between US-East, US-West, and EU, compared to all the other regions.
About the Author
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.