This course is part one of two on how to stream data using Amazon Kinesis Data Streams.
In this course, you will learn about the Kinesis service, how to get data in and out of an Amazon Kinesis Data Stream, and how streaming data is modeled.
We'll also cover Kineses Producer Applications and Kinesis Consumer Applications and how they communicate with a Kinesis Data Stream.
You'll also learn about the limits and costs of streaming data with Amazon Kinesis, as well as how data can be secured and protected inside a stream.
Learning Objectives
- Obtain a foundational understanding of Kinesis Data Streams
- Learn how to get data in and out of a stream and how to use streams to get data from its source to a destination
- Learn about the limits of Kinesis Data Streams and how to avoid them
- Understand the cost structure of the service
- Learn how to protect and secure data within a stream
Intended Audience
This course is intended for people that want to learn how to stream data into the AWS cloud using Amazon Kinesis Data Streams.
Prerequisites
To get the most out of this course, you should have a basic knowledge of the AWS platform.
Amazon Kinesis is a family of streaming services available from AWS. Currently, there are four services available. They are Kinesis Data Streams, Kinesis Data Firehose, Kinesis Video Streams, and Kinesis Data Analytics.
Kinesis Data Streams is a real-time data streaming service that can capture and process gigabytes of text-based data per second from millions of sources and store it durably. Data sources include website clickstreams, event streams, financial transactions, social media feeds, application logs and sensor data from IoT devices. This enables things such as analytics, dashboards, anomaly detection, and dynamic pricing to happen in real time.
That's a fine description but, I'm not sure how clear it is. Let me put it another way. Amazon Kinesis Data Streams is a type of fast queue that can handle up to a million requests per second. According to the documentation, it's possible to have 10,000 shards in a stream and each shard has the ability to process 1,000 records. That's ten million pieces of information. Per second.
AWS built Amazon Kinesis because, historically, increasing capacity to manage the volume of inbound streaming data is difficult. When people think about scaling, it's usually about growing to accommodate demand. However, scaling is an elastic process. It should shrink as demand decreases.
I'll start with a little history of the origins of streaming data. Once upon a time--because all good stories seem to start that way--data collection was done with a single server. This server processed as much data as it could until, one day, it needed a queue to keep up with demand.
At first, multiple processes put data in the queue and the server would process the data. The server was the consumer or worker. This consumer processed the queue and fed a database. Eventually, the one consumer was overwhelmed by the queue and more consumers were needed. This addition of more consumers put a strain on the database. This, in turn, required bigger and faster databases. Some databases evolved into Data Warehouses and Data Lakes.
At some point, larger and faster data stores became impractical. Consumer applications were refactored to write data in batches to the data store. Having batches meant there might be partial or duplicate data written to the data store. Special error handling was required.
Then, because the volume of inbound data overwhelmed the queue, the queue, itself, was broken up into smaller pieces called shards to keep up with demand. Streaming data didn't just appear one day as if by magic. <Poof!> It was organic and evolved to meet the needs of data collection.
Amazon Kinesis was built to address the needs of high-speed data ingestion in the AWS Cloud. Amazon Kinesis Data Streams can continuously capture gigabytes of data per second from sources such as mobile clients, website click streams, social media feeds, logs, and events with millisecond response times.
Kinesis Data Streams is not complicated but it is complex. It has a fair number of moving parts and it's important to understand all of them if you're going to use Data Streams efficiently and effectively.
I want to talk about how Kinesis Data Streams works and, to do that, I need to introduce some vocabulary. If you've worked with streaming data before, either with a custom framework or using something like Apache Kafka, some of these words will be familiar to you.
The important words and phrases that you need to learn are: Shard, Data Record, Partition Key, Sequence Number, Data Blob, Producer, and Consumer.
I'm going to start with how data moves through a Kinesis Data Stream.
Data from various sources is collected and formatted using a Kinesis Producer. The Producer--also called a Producer Application--puts data into a Kinesis Data Stream as Data Records in real time. The process of putting data into a stream is also referred to as data ingestion.
There are multiple ways to create Producer applications. You can use the Kinesis APIs and the AWS SDK as well as the Kinesis Agent, the Kinesis Agent for Windows, and the KPL, the Kinesis Producer Library. Additionally, some AWS services--Amazon CloudWatch, for example--can also be Producers for Kinesis Data Streams.
The Kinesis Agent is a stand-alone Java software application for Linux that offers an easy way to collect and send data to Kinesis Data Streams. The agent continuously monitors a set of files and sends new data to your stream. The agent handles file rotation, checkpointing, and retry upon failures. It delivers data in a reliable, timely, and simple manner.
The Kinesis Agent for Windows efficiently and reliably gathers, parses, transforms, and streams logs, events, and metrics to Kinesis Data Streams. The Kinesis Agent for Windows can also be used with Kinesis Data Firehose, Amazon CloudWatch, and CloudWatch Logs.
The Kinesis Producer library--or KPL--is an easy-to-use, highly configurable library from AWS designed to help programmers write data to a Kinesis data stream.
The KPL acts as an intermediary between a producer application and the Kinesis Data Streams APIs. Essentially, it is a layer of abstraction that includes error handling, automatic retries, data aggregation, and sending monitoring data to Amazon CloudWatch.
Several AWS services can stream data directly into a Kinesis Data Stream. Those services include Amazon CloudWatch, CloudWatch Logs, Amazon EventBridge, the Amazon Database Migration Service, and even another Kinesis Data Stream.
For example, instead of installing an agent on an EC2 instance to send performance data to a Kinesis Data Stream, it can be sent directly to the stream from Amazon CloudWatch.
No matter what type of Producer is used, data is put into a Kinesis Data Stream as a Data Record. Each Data Record has three parts. There is a Partition Key, a Sequence Number, and a Data Blob composed of Base64 encoded data.
The Partition Key determines to which shard inside the stream each Data Record is written.
Inside a stream, each individual Data Record is put into one--and only one--Shard. As I just mentioned, the Partition Key determines to which shard a Data Record is written. Data Records stay inside the shard until they expire.
Streams consist of one or more shards, and each shard provides a fixed amount of throughput capacity for a stream. Because of this, the total number of shards in a stream determines the available throughput. When more throughput is needed, add one or more shards. When there's too much available throughput, remove shards.
The process of added or removing shards in a stream is called resharding. Resharding is how Kinesis Data Streams scales up and down. This process is not done automatically. It must be done programmatically and requires constant monitoring to be done effectively and efficiently.
Consumers--also called Consumer Applications--access the shard to process the streaming data and send it to its destination. Destinations include Data Lakes, Data Warehouses, durable storage, or another even another stream.
To effectively use streaming data, It's important to understand the structure of a Kinesis Data Stream. As I mentioned earlier, a Producer takes data from a source and puts it into a stream as a Data Record. Inside the stream it goes into a single shard. Data Records are never split between shards.
There are both theoretical and practical reasons for this. The biggest one is that the process of putting Data Records in a shard is not random. Better stated, the process of putting Data Records into a shard is deliberate and it is based on the Partition Key.
Similar data, therefore, is grouped together in one or more shards. This makes it possible for Consumers, those applications that process streaming data, to work with the correct data. I'll get to Consumers shortly but, before I do that, I want to look at the data stream a little closer.
Data Records sit in the stream--like a buffer--for 24 hours by default. This time period can be extended for an additional cost. Data streams are immutable. This is an important concept. Once inside a stream, a Data Record cannot be deleted or edited, it can only expire.
This allows streams to be used by multiple Consumer applications and replayed as needed. This is why I said that streams act like a buffer. Producer applications put data into the stream in real time. However, the data in the stream does not need to be consumed in real time. It can be, it just does not have to be. As I've said, already, data is available until it expires. Then, it is gone forever.
Back to getting data out of the stream. A Consumer application--and there can be more than one--accesses a stream to process it. Data in a steam does not care whether or not it has been consumed before it expires. I suppose it's like a surly teenager brooding in its room.
Once the Data Record expires, the data is dropped from the stream and it's gone forever. There's no error or warning if it has not been consumed. Goodbye sulking teenager.
Maybe it's more of Zen thing. Data in the stream doesn't care if it's been processed or not. It just exists in the stream until it doesn't. This behavior is different from a queue in the Amazon Simple Queue Service.
Records in an SQS Queue are marked as invisible when they are being accessed and removed from the queue when processing is complete. Messages that can't be processed by the SQS consumer are put into a dead letter queue for analysis or manual processing. This means that, when using Kinesis Data Streams, error handling and message awareness has to be part of the consumer application.
A stream can be in one of four states: CREATING, DELETING, ACTIVE, UPDATING. Streams in the CREATING state are being provisioned. Similarly, streams in the DELETING state are being removed from an account. When ACTIVE, the stream has been provisioned and is ready for read and write operations.
Oddly enough, the AWS documentation also says that, in an ACTIVE state, a stream is ready to be deleted. Now, I've worked in--and around--information technology for a long time. Sometimes, reading things like this confuse me. It's like saying being in a MARRIED state means you're ready to be divorced. I suppose that is true. However, my understanding is that only ACTIVE streams can be deleted and that's what the documentation should probably have said.
The UPDATING state means that a resharding operation is in process. Read and write operations will continue to work while the stream is in this state.
A Kinesis Data Stream is a collection of one or more shards. A shard can be thought of in two primary ways. It is a unit of scale and it is also a unit of parallelism. As a unit of scale, each shard has a fixed amount of read and write capacity. If more throughput is needed, add more shards. Have too much, remove shards.
As a unit of parallelism, more shards can be added to a stream to allow for the concurrent writing of Data Records. While a stream--as a whole--can be thought of as a buffer--each shard is a type of high-speed ordered queue within a stream. A stream is a collection of these queues.
When a Data Record is written to a Kinesis Data Stream, it is automatically replicated across three Availability Zones. This makes streams highly available and durable.
I want to quickly review what I've covered in this lecture. Kinesis Data Streams is one of the features of Amazon Kinesis. It is a massively scalable and durable real-time data streaming service. It is used to collect data from sources such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, location tracking events, and IoT devices.
Data collected is available within milliseconds inside a stream to enable applications to do things like analytics, dashboards, anomaly detection, and dynamic pricing in real time. A Kinesis Data Stream is made up of Shards.
Producer applications put data into streams as Data Records. Data Records have three main parts; a Partition Key, a Sequence Number, and a Data Blob. Data inside a Data Record is Base64 encoded text.
Data Records are put into shards based on their Partition Key and ordered by their Sequence Number. A Data Record can go into one and only one shard. A shard is both a Unit of Scale and a Unit of Parallelism. A single shard can support a data write rate of 1 megabyte per second or 1000 records; whichever is greater.
AWS replicates shards automatically across 3 Availability Zones for durability and high availability. Consumer applications process Data Records stored in Kinesis Data Streams.
That's it for this lecture. I hope it was informative and helpful. I'm Stephen Cole for Cloud Academy and thank you for watching!
Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.
Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.
Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.
In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.