Amazon Kinesis: managed real-time event processing

As Big Data evolves, more tools and technologies aimed at helping enterprises cope are coming on line. Live data needs special attention, because delayed processing can effect its value: a twitter trend will attract more attention if it is associated with something going on right now; a logging system alert is only useful while the error still exists. To tame huge volumes of time-sensitive streaming data, AWS created Amazon Kinesis.
Amazon Kinesis is a fully managed, real-time, event-driven processing system that offers highly elastic, scalable infrastructure. It is designed to process massive amounts of real-time data generated from social media, logging systems, click streams, IoT devices, and more.
The open source Apache Kafka project actually shares some functionality with Amazon Kinesis. While Kafka is very fast (and free), it is still a bundled tool that needs installation, management, and configuration. If you would prefer to avoid the extra administrative burden and already have some AWS Cloud investment, then Kinesis may just be your new best friend.

Amazon Kinesis Architecture

Amazon Kinesis(Amazon Kinesis building blocks)

Data Records

Data Records contain the information from a single event. A data record consists of a sequence number, a partition key, and a data blob.

  • Sequence numbers are created and signed by Kinesis. Event consumers process the records according to the order of the sequence number.
  • A partition key is an identifier chosen by event submitters to generate a hash key which will determine to which shard a data record belongs.
  • Data Blobs are actual payload objects containing content like log records, tweets, and RFID records. Data blobs do not have a particular format and can be as large as 50KB.

Streams

Streams are the core building block of the Amazon Kinesis service. Data records are written to streams by event producers and read by event consumers. Streams are composed of one or more shards, while Shards are a logical subset of data within a stream. Events in a stream are stored for 24 hours.
Kinesis is meant for real-time data processing – and in real-time events, a stale record possesses little value. Amazon Kinesis streams are identified by Amazon Resource Names (ARN).

Shards

Shards are the objects to which data records are written and consumed by event producers and event consumers. Each shard gets data records according to hash key ranges. Partition keys are taken by Kinesis from data records, formatted to 128-bit hash keys, and associated with a shard for a certain range.
The Kinesis user is responsible for shard allocation, and the number of shards determines the application throughput. According to AWS Kinesis documentation:

Each open shard can support up to 5 read transactions per second, up to a maximum total of 2 MB of data read per second. Each shard can support up to 1000 write transactions per second, up to a maximum total of 1 MB data written per second.

Shards are elastic in nature. You can increase or decrease the number of shards according to your load.

Kinesis Consumers

Kinesis consumers are typically Kinesis application runs on clusters of EC2 instances. A Kinesis consumer uses the Amazon Kinesis Client library to read data from streams’ shards. Actually, streams push data records to a Kinesis application.
When Kinesis applications are created, they are automatically assigned to a stream, and the stream, in turn, associates the consumers with one or more shards. Consumers perform only lighter tasks on data records before submitting them to AWS DynamoDB, EMR, S3, or even a different Kinesis stream for further processing.
Consider a real-life Kinesis example involving a Twitter application: in a Twitter data analysis application, tweets are data records, all tweets form the stream (i.e. Twitter Firehose). The tweets are segregated by topic so each topic name can be used as a partition key. All the tweets belong to set of Twitter topics that are grouped together to form a shard.

Kinesis Operations

Amazon kinesis supports the Java API only. The following operations are performed using the Kinesis client API:

Add Data Record to Stream

Producers call PutRecord to push data to a stream or to shards. Each record should be less than 50 KB. The user then creates a PutRecordRequest and passes {streamName, partitionKey, data} as input. You can also force a strict ordering of records by calling setSequenceNumberForOrdering and passing an incremental atomic number or sequence number of previous record.

Get Records from Shards

Retrieving records (up to 1 MB) from shards or streams requires a shard iterator. Create a GetRecordRequest object, and call the getRecords method by passing the GetRecordRequest object. Obtain the next shard iterator from getRecordsResult to make next call to getRecordResult.

Resharding Streams

Resharding a stream will split or merge shards to match the dynamic event flow to the Kinesis stream. Always split a shard into two shards or merge two shards into one in a single resharding operation. As AWS Kinesis bills you per shard, merging shards cuts your shard cost by half (while splitting doubles the cost). Resharding is an administrative process that can be triggered by CloudWatch monitoring metrics.

Kinesis Connectors

Amazon Kinesis offers three connector types: S3 Connector, Redshift Connector, and DynamoDB connector.

Kinesis Pricing Model

Amazon Kinesis uses a pay-as-you-go pricing model based on two factors: Shard Hours and PUT Payload Units.

  • Shard Hour. In Kinesis, a shard provides a capacity of 1MB/sec data input and 2MB/sec data output and can support up to 1000 records per second. Users are charged for each shard at an hourly rate. The number of shards depends on their throughput requirements.
  • PUT Payload Unit. PUT Payload Units are billed at a per million PUT Payload Units rate. In Kinesis, a unit of PUT payload is 25KB. So, for example, if your record size is 30KB, you are charged 2 PUT payload units. If your data record is 1 MB, you are charged for 40 PUT payload units.

In the AWS standard region, a shard hour currently costs $0.015. So, for example, let’s say that your producer produces 100 records per second and each data record is 50 KB. This would translate as a 5MB/second input to your Kinesis stream from the producer. As each shard supports 1 MB/sec input, we need 5 shards to process 5000 KB/second (as each shard supports 1000 KB/second). So our shard per hour cost will be $0.075 (0.015*5). 24 hours of processing would therefore cost us $1.80.
Moreover, we need 2 PUT Payload Units for each data record (1 PUT Payload Unit= 25 KB chunk). Again, we’re producing 100 records per second. We are charged 2000 PUT Payload Unit/second. In an hour we are charged 7200000 PUT Payload Unit. Hence we are charged 172800000 PUT Payload Unit per day. The cost will be $2.4192 (172800000/1000000 * 0.014).
So we will be charged a total of (1.8+2.4192) $4.2192 /day for our data processing.

A few Amazon Kinesis use cases:

  • Real-time data processing.
  • Application log processing.
  • Complex Direct Acyclic Graph (DAG) processing.

With the power of real-time data processing through a managed service from AWS, Amazon Kinesis is a perfect tool for storing and analyzing data from social media streams, website clickstreams, financial transactions logs, application or server logs, sensors, and much more.
Check out our intermediate course on Amazon Kinesis to learn all about it!
Have you used Kinesis yet? Why not share your experience?

Written by

Cloud Computing and Big Data professional with 10 years of experience in pre-sales, architecture, design, build and troubleshooting with best engineering practices.Specialities: Cloud Computing - AWS, DevOps(Chef), Hadoop Ecosystem, Storm & Kafka, ELK Stack, NoSQL, Java, Spring, Hibernate, Web Service

Related Posts

— November 28, 2018

Two New EC2 Instance Types Announced at AWS re:Invent 2018 – Monday Night Live

Let’s look at what benefits these two new EC2 instance types offer and how these two new instances could be of benefit to you. Both of the new instance types are built on the AWS Nitro System. The AWS Nitro System improves the performance of processing in virtualized environments by...

Read more
  • AWS
  • EC2
  • re:Invent 2018
— November 21, 2018

Google Cloud Certification: Preparation and Prerequisites

Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2018, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the first time. In t...

Read more
  • AWS
  • Azure
  • Google Cloud
Khash Nakhostin
— November 13, 2018

Understanding AWS VPC Egress Filtering Methods

Security in AWS is governed by a shared responsibility model where both vendor and subscriber have various operational responsibilities. AWS assumes responsibility for the underlying infrastructure, hardware, virtualization layer, facilities, and staff while the subscriber organization ...

Read more
  • Aviatrix
  • AWS
  • VPC
— November 10, 2018

S3 FTP: Build a Reliable and Inexpensive FTP Server Using Amazon’s S3

Is it possible to create an S3 FTP file backup/transfer solution, minimizing associated file storage and capacity planning administration headache?FTP (File Transfer Protocol) is a fast and convenient way to transfer large files over the Internet. You might, at some point, have conf...

Read more
  • Amazon S3
  • AWS
— October 18, 2018

Microservices Architecture: Advantages and Drawbacks

Microservices are a way of breaking large software projects into loosely coupled modules, which communicate with each other through simple Application Programming Interfaces (APIs).Microservices have become increasingly popular over the past few years. The modular architectural style,...

Read more
  • AWS
  • Microservices
— October 2, 2018

What Are Best Practices for Tagging AWS Resources?

There are many use cases for tags, but what are the best practices for tagging AWS resources? In order for your organization to effectively manage resources (and your monthly AWS bill), you need to implement and adopt a thoughtful tagging strategy that makes sense for your business. The...

Read more
  • AWS
  • cost optimization
— September 26, 2018

How to Optimize Amazon S3 Performance

Amazon S3 is the most common storage options for many organizations, being object storage it is used for a wide variety of data types, from the smallest objects to huge datasets. All in all, Amazon S3 is a great service to store a wide scope of data types in a highly available and resil...

Read more
  • Amazon S3
  • AWS
— September 18, 2018

How to Optimize Cloud Costs with Spot Instances: New on Cloud Academy

One of the main promises of cloud computing is access to nearly endless capacity. However, it doesn’t come cheap. With the introduction of Spot Instances for Amazon Web Services’ Elastic Compute Cloud (AWS EC2) in 2009, spot instances have been a way for major cloud providers to sell sp...

Read more
  • AWS
  • Azure
  • Google Cloud
— August 23, 2018

What are the Benefits of Machine Learning in the Cloud?

A Comparison of Machine Learning Services on AWS, Azure, and Google CloudArtificial intelligence and machine learning are steadily making their way into enterprise applications in areas such as customer support, fraud detection, and business intelligence. There is every reason to beli...

Read more
  • AWS
  • Azure
  • Google Cloud
  • Machine Learning
— August 17, 2018

How to Use AWS CLI

The AWS Command Line Interface (CLI) is for managing your AWS services from a terminal session on your own client, allowing you to control and configure multiple AWS services.So you’ve been using AWS for awhile and finally feel comfortable clicking your way through all the services....

Read more
  • AWS
Albert Qian
— August 9, 2018

AWS Summit Chicago: New AWS Features Announced

Thousands of cloud practitioners descended on Chicago’s McCormick Place West last week to hear the latest updates around Amazon Web Services (AWS). While a typical hot and humid summer made its presence known outside, attendees inside basked in the comfort of air conditioning to hone th...

Read more
  • AWS
  • AWS Summits
— August 8, 2018

From Monolith to Serverless – The Evolving Cloudscape of Compute

Containers can help fragment monoliths into logical, easier to use workloads. The AWS Summit New York was held on July 17 and Cloud Academy sponsored my trip to the event. As someone who covers enterprise cloud technologies and services, the recent Amazon Web Services event was an insig...

Read more
  • AWS
  • AWS Summits
  • Containers
  • DevOps
  • serverless