Amazon Kinesis Data Streams
The course is part of this learning path
This course is part one of two on how to stream data using Amazon Kinesis Data Streams.
In this course, you will learn about the Kinesis service, how to get data in and out of an Amazon Kinesis Data Stream, and how streaming data is modeled.
We'll also cover Kineses Producer Applications and Kinesis Consumer Applications and how they communicate with a Kinesis Data Stream.
You'll also learn about the limits and costs of streaming data with Amazon Kinesis, as well as how data can be secured and protected inside a stream.
- Obtain a foundational understanding of Kinesis Data Streams
- Learn how to get data in and out of a stream and how to use streams to get data from its source to a destination
- Learn about the limits of Kinesis Data Streams and how to avoid them
- Understand the cost structure of the service
- Learn how to protect and secure data within a stream
This course is intended for people that want to learn how to stream data into the AWS cloud using Amazon Kinesis Data Streams.
To get the most out of this course, you should have a basic knowledge of the AWS platform.
Hello! I'm Stephen Cole, a trainer here at Cloud Academy and I'd like to welcome you to this lecture presenting the Elements of an Amazon Kinesis Data Stream.
When someone refers to an Amazon Kinesis Data Stream, in general, they are talking about how large volumes of data move from one place to another at high speed. The truth is that this is a generalization about what happens. Streaming data is a complex process with several parts. The word streaming is descriptive but not entirely accurate. In some ways, it's like describing flying as air travel. Nobody just gets on a plane and flies away. Not even pilots.
For passengers, there's a process that involves everything from buying a ticket, getting a boarding pass, going to--and getting through--an airport, handling baggage and carry-on items, and boarding a plane that happens before the flying part even happens.
That's just half of the passenger experience. After landing and getting off of the airplane, that process happens in reverse except, of course, it isn't exactly the same because you're in a different place and airport security controls are focused on people entering, not exiting.
So, streaming is more than moving lots of data at high speed.
Streams have to be provisioned. Information has to be collected and formatted correctly. Data has to be put into the stream and, just as important, needs to be retrieved. This is not complicated but it is complex.
In this lecture, I am going to examine, in some detail, what a stream is and how to get data in and out of it. Before I start, if you follow along with me or, at some point, decide to practice on your own, you will incur charges while using Amazon Kinesis as there is no free tier for the service.
Kinesis Data Streams charges for open shards and for Data Records stored in a stream. At a small scale, the costs are low. While I was creating this lecture I looked up the pricing details. In the region I'm using, it's only one and a half cents per shard per hour.
The PUT payload units are priced at one point four cents per million. Both of these are in US dollars. As such, it won't cost very much to experiment with Kinesis Data Streams. The thing to remember is, when done with your learning exercises, delete the stream or streams you have created. Pennies add up to dollars, and, over time, dollar amounts can be embarrassing when the bill comes due.
Speaking from experience, forgetting to delete or turn off provisioned capacity of any service is embarrassing. It's a lesson I won't forget and one that I hope you don't have to learn on your own. Creating a Kinesis Data Stream is a relatively simple process. It can be done from the AWS Console, programmatically, or using the AWS CLI.
Here is the command to create a Kinesis Data Stream called myStream with 3 shards using the AWS CLI. The AWS CLI command describe-stream will display the details of the stream and include information about its available shards.
Normally, I find JSON difficult to read but, in this case, I think seeing the output of describe-stream in JSON will make it easier to understand the stream's internals. Here's the full output.
The shard details are at the top and, at the bottom, are details about the stream, itself. If you're looking for information about a Kinesis Data Stream without the Shard details, the AWS CLI command is describe-stream-summary.
It looks like this. It's the same output minus information about the shards. Since I want to discuss the shards in the stream, here's the shard detail again. Looking at the shards, there are some details that are important to know and understand.
First, each shard has its own unique ID. Next, notice that the first shard has a StartingHashKey of 0 and an EndingHashKey with a long number ending in 0484. The next shard has a StartingHashKey range that starts with a long number one greater than the EndingHashKey from the first shard. I'm not reading that number out loud, sorry. Instead, I can see that It ends with 0485.
Likewise, the third shard has a StartingHashKey that is one greater than the second shard's EndingHashKey. If I were to add a fourth shard to the stream, its StartingHashKey would be one greater than the third shard's EndingHashKey. This is an overly-detailed way of saying that every shard has a unique hash key range that does not overlap.
This hash key has an important role inside the shard. When a Data Record is put into a Kinesis Data Stream, it includes--as part of its payload--a Partition Key. Kinesis Data Streams processes the Partition Key and creates a hash value from it. It is this hash value that determines to which shard a Data Record is written.
Because of this, similar data--such as data from a single source or single location--will always end up in the same shard or shards. It might sound like I'm repeating myself but this is important to understand.
To reiterate, a Producer application puts data into a stream as a series of Data Records. A Data Record is a unit of information stored in a stream and is composed of a Partition Key, a Sequence Number, and a Data Blob.
The Data Record's Partition Key determines the shard within the stream where the Data Record will be written. Kinesis calculates an MD5 hash value of the Partition Key and, based on this value, decides to which shard the record will be written.
The Data Record's Sequence Number is an identifier that is unique within each shard. This ensures that data is stored in the order in which it was written until it expires. The capacity for each shard is limited. You can put up 1,000 records--up to 1 megabyte of data--in a shard per second.
To maximize throughput, it's best to distribute Data Records across all available shards. To do this, the AWS recommendation is to use logic that creates random partition keys. This will help ensure records are distributed evenly across shards in a stream.
When there are too many records being assigned to a single shard, it becomes hot while other shards remain cold. Something that is worth mentioning is that the size of the partition key is not included in the 1 megabyte limit for a single record's payload, it is, however, counted as part of the throughput limit.
Finally, the data record's Data Blob is a sequence of Base64 encoded bytes that can be up to 1 megabyte in size. This blob is both opaque and immutable to Kinesis Data Streams. This means Kinesis does not inspect, interpret, or change the data in any way.
As I recently mentioned, Data Records can be put into a Data Stream at a maximum rate of 1,000 records per second up to a total of 1 megabyte per second. This limit determines how many shards need to be provisioned per stream. If more throughput is needed, it is possible to add more shards to a stream. This process is called resharding.
Resharding can be done from the AWS Console, the command line, or programmatically. For a use case that streams 3,000 data records per second, 3 shards are needed.
When writing to a Kinesis Data Stream, a Data Record is put into exactly one shard for storage and processing. It will stay there until it expires. Once inside a shard, Data Records are immutable; they cannot be edited or deleted. If a Data Record needs to be updated, a new record can be added to the stream to replace it.
When creating a stream, the default retention period is set to 24 hours. This is also the minimum size of the retention period. Until November of 2020, the maximum retention period for data in a stream was 168 hours. This is 7 days. With the November 2020 change, Data Records can be retained in a Kinesis Data Stream for up to 8,760 hours. This is 365 days.
Originally, the default expiration was 24 hours and could be extended up to 7 days for an additional charge. This is still true and works the same way as it did before the update. Data Records stored beyond 24 hours and up to 7 days is billed at an additional rate for each shard hour.
Now, after 7 days, data stored in a stream is billed per gigabyte per month for up to a year. The retention period is configured when creating a stream and can be updated using the
DecreaseStreamRetentionPeriod() API calls.
There is also a charge for retrieving data older than 7 days from a Kinesis Data Stream using the
GetRecords() API call. There is no charge for long-term data retrieval when using the Enhanced Fanout Consumer using the
SubscribeToShard() API call. Enhanced Fanout is something I'll talk about shortly.
After a data record has been consumed, it remains in the stream until it expires. This allows the stream to be reprocessed or replayed as needed. It also means that multiple applications can have consumers that process the same stream.
I want to talk about a Data Stream's limits. Some of this will be a review. Each data record has a maximum size limit of 1 megabyte, each shard can accept 1,000 records per second, and the default retention period is 24 hours.
The size of the data record cannot be increased but the retention period can be extended up to 7 days for an additional charge and, then again, up to a year.
I want to address, at a high level, some of the limitations that Producer and Consumer applications have. Producers, when they put records into a stream, are limited to writing 1 megabyte per second per shard or 1,000 writes per shard per second.
If there are 5 shards, there will be 5 MB or 5,000 messages of aggregate throughput available. However, exceeding the write limits for a single shard will return an error,
There are 2 types of consumers for Amazon Kinesis Data Streams, the original shared-throughput Consumer and Enhanced Fan-Out. Some people refer to the original Consumer as either the Classic Consumer or the Standard Consumer. With the original shared-throughput Consumer, each shard supports 2 megabytes per second of read throughput.
There's also a limit of 5 API calls per second per shard across ALL of consumers. This translates to a maximum of 10 megabytes per second per shard of read throughput. Related to the 5 API calls per second for each shard, each read request can return up to a total of 10,000 records. If these limits have been reached, subsequent requests made within the next 5 seconds will throw an exception and be throttled.
Thinking about this and doing the math, this is logical. If one request returns 10 megabytes, that's 5 seconds worth of data from a shard.
One of the advantages of Kinesis Data Streams is that it is possible to attach multiple Consumers to the same stream. More than that, these Consumer applications can be different. One application can aggregate records in the data stream, batch them, and write the batch to S3 for long-term retention.
A second application can enrich the records and write them into an Amazon DynamoDB table. At the same time, a third application can filter the stream and write a subset of the data into a different Kinesis Data Stream.
Using the standard Consumer, applications shared the same 2 megabyte per second read throughput. Because of this, no more than two or three functions could efficiently connect to the data stream simultaneously.
To achieve greater outbound throughput across multiple applications using the Standard Consumer, data must be spread across multiple streams. If a developer needs 10 gigabytes per second of read throughput to support five different applications they will need multiple streams with duplicate data.
To address this, AWS added two features to Kinesis Data Streams: Enhanced Fan Out and an HTTP/2 data retrieval API. HTTP/2 is a major revision to the HTTP network protocol that introduces a new method for framing and transporting data between clients and servers. It’s a binary protocol that enables new features designed to decrease latency and increase throughput.
Each Enhanced Fan-Out Consumer gets 2 megabytes per second per shard of throughput. This sounds like the standard Consumer but it is different. The standard Consumer retrieves data by making a GetRecords()API request. Data Records are being pulled out of the shard. All standard Consumers share the same 2 megabyte per second throughput.
With Enhanced Fan-Out, the Consumer application will first register itself with Kinesis Data Streams. Once registered, it can make a SubscribeToShard() request and data is to the application at a rate of 2 megabytes per second for five minutes.
After five minutes, the application will need to make another SubscribeToShard() request to keep receiving records. Once subscribed to the shard, Data Records are pushed from the shard to the consumer automatically. Every Enhanced Fan-Out Consumer gets the full 2 megabytes per second of throughput.
There's no sharing of bandwidth. This means that 5 consumers would get 10 megabytes per second per shard of throughput, 10 consumers will get 20 megabytes per second per shard, and so on. The 2-megabyte per second per shard limit is gone. It can do this because Amazon Kinesis Data Streams pushes data to consumers using HTTP/2. Pushing data to consumers also removes the 5 API calls per shard per second limit. This increases the ability for consumer applications to scale and substantially reduces latency.
With the standard Consumer, there's a limit of 5 requests per second per shard. Since a second is 1,000 milliseconds, this means that there's approximately 200 milliseconds between each request. A sixth consumer for a shard would add an additional one second of latency. This means that, with the standard Consumer the latency is between 200 and 1,000 milliseconds.
The push model with Enhanced Fan-Out reduces this to an average of 70 milliseconds. This is substantially faster. For perspective, the average human eye takes about 100 milliseconds to blink. For this performance boost, there is an increased cost. Be aware of that.
Also, by default, there's a soft-limit of 20 consumer applications that can be registered per stream. However, this number can be increased by creating a support ticket with AWS. Each consumer application can only be registered with one data stream at a time.
There are use cases for both the Standard Consumer and Enhanced Fan-Out. Use the standard Consumer if there are less than five consuming applications that can tolerate a 200 millisecond latency.
Another factor to consider is that the standard Consumer also costs less than Enhanced Fan-Out. If minimizing costs is important, use the standard Consumer. Use Enhanced Fan-Out when five to ten applications are simultaneously consuming the same stream and latency of around 70 milliseconds is important. Also Enhanced Fan-Out works only if you can tolerate higher costs.
I want to spend a moment to review the elements of a Kinesis Data Stream that I've covered in this lecture. Streams can be created using the AWS Console, programmatically using an SDK, or from the AWS CLI.
Charges start to accumulate as soon as a stream has been provisioned. When creating a stream, it has to have a unique name and at least one shard. Each shard in a steam has a Shard ID, a Hash Key range, and a starting Sequence Number.
The Shard ID is unique to the shard, the Hash Key ranges do not overlap between shards, and the Sequence Number is used to keep Data Records in order. When a Producer writes a Data Record to a stream, an MD5 Hash is created based on the Partition Key. This hash key value determines to which shard the Data Record is written.
Producers, when they put records into a stream, have a pair of limits. Data can be written at a rate of 1 megabyte per second per shard or there can be 1,000 writes per shard per second. There are two types of Consumers available to read a Kinesis Data Stream; a Standard Consumer and an Enhanced Fan-Out Consumer.
The Standard Consumer uses a polling method to get data from a stream. For reads, each shard can support up to 5 API calls per second, return up to 2 megabytes per second, to a total of 10,000 records per second.
Enhanced Fan-Out uses a push mechanism to send data to Consumers. There can be up to 20 Consumer Applications registered to a stream and each Consumer gets 2 megabytes per second per shard of throughput. This throughput comes at a cost. Be aware.
Okay, that wraps up this lecture. I hope you found it interesting and informative. I'm Stephen Cole for Cloud Academy. Thanks for watching!
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.