In this course, we will explore the Analytics tools provided by AWS, including Elastic Map Reduce (EMR), Data Pipeline, Elasticsearch, Kinesis, Amazon Machine Learning and QuickSight which is still in preview mode.
We will start with an overview of Data Science and Analytics concepts to give beginners the context they need to be successful in the course. The second part of the course will focus on the AWS offering for Analytics, this means, how AWS structures its portfolio in the different processes and steps of big data and data processing.
As a fundamentals course, the requirements are kept simple so you can focus on understanding the different services from AWS. But, a basic understanding of the following topics is necessary:
- As we are talking about technology and computing services, general IT knowledge is necessary, that is, the basics of programming logic, algorithms, and learning or working experience in the IT field.
- We will give you an overview of data science concepts, but if these concepts are already familiar to you, it will make your journey smoother.
- It is not mandatory but it would be helpful to have a general knowledge about AWS, most specifically about how to access your account and services such as S3 and EC2.
The following two courses from our portfolio can help you better understand the basics of AWS if you are just starting out:
If you have thoughts or suggestions for this course, please contact Cloud Academy at firstname.lastname@example.org.
Welcome to the AWS Analytics Fundamentals course. In this video, we will cover the Amazon Kinesis family, composed by Kinesis streams, Kinesis Firehose, and Kinesis analytics. In the end of is video, you'll be able to create a Kinesis stream and understand the main concepts behind streaming data on AWS.
Amazon Kinesis service is composed by a set of streaming services focusing on massive scale processing of real-time data. But first of all, what exactly is a stream in this context? A stream is a constant data flow generated in real-time. Usually when we talk about streaming, we think first in video streaming. But here we are talking, instead of videos, about data. Data streams. Like for example, a website click stream, where a different online users click on different areas and links of the website, and a script captures in the background these clicks and send it to a stream service that must support the massive volume of data generated and provide an easy way to be consumed by other applications to detect important information like hit maps from most frequent accessed areas onscreen, popular posts, also the better place for advertisement.
All this real-time data needs to be quickly processed, analyzed, and the results saved back to a storage engine, notify a set of recipients or displayed in a dashboard. In this context, the Kinesis streaming service deliver elastic streaming service to support these data intensive applications. Abstract and infrastructure layers to developers so we can focus on application and the analytics layers without a burdened controlled querying organization and all the streaming data flow.
The Amazon Kinesis family is currently composed by three services, the Kinesis streams, the first service launched and so far the most used, the Kinesis Firehose, and Kinesis Analytics, both launched on re:Invent 2015. The data streaming service, or Kinesis Streams, was the first service launched on late 2013 for massive scale real-time data streaming. Firehouse focuses on the delivery of streaming data directly to a storage destination, such as S3 and Redshift. And Kinesis Analytics allows you to run on your stream SQL comments, search on real-time streaming data. We will go now into the details for each service.
Amazon Kinesis streams enable you to build custom applications to process or analyze streaming data. You can have an unlimited number of sources writing data to a stream. Each service is called a producer on Kinesis terms. To consume Kinesis, we take benefit of the KCL, the Kinesis Client Library. With Amazon Kinesis Client Library, or KCL, we can build Amazon Kinesis applications and use streaming data to power real-time dashboards, generate alerts, implement dynamic pricing and advertising, and a lot more.
We'll go through the main characteristics very quickly to focus on important concepts and our tutorial. It's real-time as the core of a streaming service should be, it's easily consumed by the two libraries, the KCL, used to get items from the stream, and the KPL, or the producer library, to put items on the stream. You can get data from the same stream by different applications. If you want to perform different actions on your data, like transforming the data to insert into Redshift, and with the same data on the stream inserted on an elastic search to query and get visualizations with Kibana, you can do it. You can control the elasticity from your stream by adjusting the throughput, and as usual, you do not have any upfront cost. You pay as you go. In case of an availability zone failure, your data is not lost as it's replicated by default to three facilities at the same region.
We will talk now about some very important concepts. Please pay attention, as certification questions are based on these concepts. As well, they're very important for understanding of how the Kinesis streams work. We start with the shard, which is the base throughput unit for the Kinesis streams. This means the capacity from your stream to receive and deliver data. The shard provides the capacity of 1 megabyte per second input in 2 megabytes per second output, and up to 1,000 put records per second.
Resharding is the process of adjusting your stream to your data. This means you can add capacity by adding shards. This process is called splitting. Or you can reduce the amount of shards, as we call merging. Resharding is considered an advanced operation. This process occurs dynamically when started. A data record, as the name says, defines the data formats that you can put into a stream.
A data record is composed by its sequence number, a partition key, and a data blob. The data blob is your content and can have up to one megabyte in payload. Payload means the data itself that you put there. The sequence number is automatically generated by Kinesis streams and the partition key is a key to better identify your data, and route them between shards.
By default, the data records remain on the stream for 24 hours, and this can be changed to up to 7 days. Pay attention to the default and maximum time the data records remain in the stream, as this is important information for questions covering streaming troubleshooting on the AWS exams. For example, a question could be formulated telling that after a weekend, the content stream disappeared, which if the defaults have been kept on the stream, this means that the records were per design removed after the 24 hours period.
The partition key is important to route or segment your data among the shards. For example, if you have two shards, you can add records with keys, shard A and shard B to get your data with same key routed to the same shard. This also helps delivery order from your data to the clients. The partition key's assigned to the data record by the partition producer when putting your records on the stream.
Taking benefit of the KPL, you define a partition you want to use. The sequence number, it's a unique identifier for your record assigned automatically when the producer calls PutRecord or PutRecords API call to add new records to the stream. The sequence number increases over time so if you have a long interval between API calls, this number might have a big increase over previous put requests.
The Kinesis process. The picture below describes a common Kinesis stream's use-case. First we have to create our stream and mentioned properly to handle the right amount of data. To create the Amazon Kinesis stream we can use either the Amazon Kinesis management console or create stream API call. Then we need to develop the producer to continuously put data into your Amazon Kinesis streams. There is no way to put data to the console. You need to code an agent or a producer to insert the data into your stream.
In the figure we can see the different sources running producers, from mobile devices to servers and desktops. The producers put data into our stream using the API calls PutRecord for a single record, or a PutRecords for a batch insert. The data records are then inserted into the shards according to the producer defined partition key.
The EC2 instances are running the consumers developed using the KCL library that read the data and process it, sending the data, for example, to rest on S3 or DynamoDB to be further analyzed on Redshift and EMR for example. This is the basic process for a streaming application.
Now we are going to create a Kinesis stream, and later, insert streamed data into this stream. So first of all, let's create a stream. We log onto AWS console and go to Kinesis. If this is your first time, you'll see a different page from mine. Let's choose Kinesis streams. Here I have already three streams created. Let's create another one and let's put one shard. Remember that we have already seen information about the shards. The shards are the tripled unit, and they are measured in one megabyte per second write and two megabyte per second reads. And the max transactions you can do up to 5 reads per second and 1,000 writes per second. So the shard capacity depends on your data. If you have a lot of records to include per second, you have to increase or decrease the number of shards. Let's hit create and that's it. That's all you need to create a stream. Simple, right? The status is creating.
In the meanwhile, let's proceed with the producer. The producer is the client that inserts data into our stream. We are going to use the Kinesis agent. This agent was developed by AWS. The internals from the agent as well as insulation will not be covered on this session, as this is a fundamentals training. I'm logged in an EC2 instance, okay. This EC2 instance has a role attached with access to Kinesis.
So we want to do here now, I will insert data into the Kinesis stream. First of all, let me show you the configuration from the AWS Kinesis agent. This is a simple JSON file. What's really interesting to us is this information here, these three lines. The file pattern, where my data will be stored locally. What this agent does, this agent collects the data that is inserted into this file, this app.log file, and ships this data to this stream, kinesis stream TestCA. So the TestCA, or this first stream here, we receive data that we include in this log here. So let's include some data.
I've just created this short BASH script that will insert 1,000 records to our app.log file. Inserted 1, 2, 3,000 now, so we insert the data here. Theoretically, our test stream should have already some streamed data, that we give a look at the monitoring part from our stream. We have one open shard and let me see our put record requests we had in the last hour, and last three hours, and now as we have a five minutes period, we had no operations so far. Let's wait a while. Now we can see that some records have appeared here.
Let's get, for example, the put requests. We had an increase to now to 111. So we can see that our stream is receiving data. As for this demo, we will not consume it as the client library requires a more complex construct, but you could see here that we can easily ingest data with the AWS agent or you can code your own agent with the KPL, the Kinesis Producer Library.
Kinesis Firehose is a simple way to get data directly from your application to S3 and Redshift. With Firehose, direct data ingestion from a stream is made very simple. You don't need any consumer to get data from the stream, and insert into an S3 bucket, for example. Firehose does this automatically. As you can see in the figure, we need only to take care about the producer development to put the data into the delivery stream, which is a special type of stream. When the data is in the stream, it will be automatically sent to the final delivery methods. When stored, you can use other tools to consume it or visualize your data with third party BI tools or process it with another analytics framework.
The main features are the easy setup to configure your stream and point it to S3 or Redshift, the ability to load data nearly in real-time, the automatic scaling without intervention, the known multiple back ends for destination, the CloudWatch metrics integration, and automatic data encryption. On Kinesis Firehose, you create a different kind of stream, what we call a delivery stream. A delivery stream does not require consumers, as it will automatically deliver its contents to the back end you set on creation. This means currently only an S3 bucket or Redshift database are supported.
It's also important to note that in Firehose, we do not have shard configuration or partition keys. A record on Firehose is a data blob up to one megabyte in size. Note that you don't need to specify a partition key. So in your put request you only pass your blob to the stream and that's it. The only code you need to write is related to producers, the data producers to delivery stream. As I said before, you do not need to set the shard configuration. This means the shards are automatically increased or reduced according to your stream data.
Now we're going to create our Firehose delivery stream. So first of all, look into your AWS account and go to Kinesis. If it's your first time on the Kinesis service, you'll receive this page. We'll go to Kinesis Firehose and we're going to create the delivery stream. First of all, we need to define the destination. As we are talking, we have only now two supported destinations, an S3 bucket or to a Redshift database. Let's choose an S3 bucket. We have to define the name. Let's put TestCA just for testing. We need to select a bucket. Let's use one of our currently existing buckets. The prefix is an optional field which will be appended to every file created. We type next, which will go us to the configuration for our delivery stream. First of all the buffer. What's the buffer? The buffer will tell the delivery stream when they will put the contents from the stream to the S3 bucket when they will create a new file. So we can define, for example, per default every five megabytes the data will be removed from the stream and the S3 file created. Or every, for example, 300 seconds, which will be 5 minutes, every 5 minutes all the records in the stream will be shipped to S3.
We can also create compressed content and encrypted content. The default leaves uncompressed and unencrypted. We need to create our update and IAM role to allow our Firehose delivery stream to add data to the S3 bucket. We don't have any, so we have to create. Good, it has been automatically created. Now we go back to the configuration, and we type next. Here we have the review, so we created a delivery stream named TestCA that we will send our content to our S3 bucket. We defined that every five minutes or every five megabytes they will be purged to the S3 as a new object, uncompressed without encryption, and we created a role.
So now we create the delivery stream. With the delivery stream created, we have now to create the producer to put data into our delivery stream. Now we have to send some data to this delivery stream. As we did before, we are going to use the Kinesis agent provided by AWS to ship data to our delivery stream. In the configuration here, we have defined the file pattern, so again, we will save data to app.log and the data will be sent automatically to TestCA delivery stream. Let me give a look here at the size from our app.log, 2,000 lines. Let's create some more entries here. I create 1,000 entries, more 1,000, more, 2,000 I think it's okay. Now we should have 15,000. Good.
Now we are going to verify our Kinesis streams. We are our delivery stream, TestCA. We go to the monitoring, and we can see that something happened here in the last minute. We can see that incoming bytes has grown a lot, incoming records, remember how many we created? We created 3,000. And here we can see that 3,000 entries were just inserted in our delivery stream. That's really nice. It shows that it's working. But what really matters is the output. We must see something coming in our S3 bucket. We have here some information that something was delivered to a stream, but let's confirm it. This bucket should have a folder named logs, yes. Inside of it as informed, the delivery stream creates its format based on the year.
This is to improve the searches, as usually it would create thousands and thousands of files here so to improve the search performance, it adds this kind of structure. It's the best practice. And here we can see that in fact it created two files so far. Here as you can see the file D21 and the data. So it's really getting the data we created in our EC2 instance, directed to the delivery stream, and then to the S3 bucket. These are very flexible ways to get data to AWS.
Kinesis Analytics is a service that allows you to perform SQL queries over your streams. The service is not generally available and was launched on AWS re:Invent 2015 on late October. Besides the possibility to run SQL queries over streams, Kinesis Analytics also allow you to build real-time applications by sending the results from your queries to visualization and other data analytics tools. Scalability's managed internally, so you don't need to worry about throughput or performance. As a service not available to the public, we will not be able to provide a tour on the console.
That was for this video. We hope you have enjoyed and learned a little bit about the Kinesis Streaming services. See you in the next video.
Fernando has a solid experience with infrastructure and applications management on heterogeneous environments, working with Cloud-based solutions since the beginning of the Cloud revolution. Currently at Beck et al. Services, Fernando helps enterprises to make a safe journey to the Cloud, architecting and migrating workloads from on-premises to public Cloud providers.