Amazon Kinesis Analytics - In-depth Review
Amazon Kinesis Analytics - In-depth Review

We dive into the internals of Amazon Kinesis Analytics service, describing in detail the individual parts that make up the service. We'll introduce you to the key features and core components of the Kinesis Analytics service. We then take an in-depth look at the following components of Amazon Kinesis Analytics: Applications, Input Streams, SQL Queries, Pumps, Output Streams, Tumbling Windows, and Sliding Windows.


Let's now focus our attention on Amazon Kinesis Analytics. Amazon Kinesis Analytics enables you to quickly author SQL code that continuously reads, processes and stores data in near real time. With Amazon Kinesis Analytics, you can ingest in real time billions of small data points. Each and every individual data point can then be aggregated to provide intelligent business insights, which in turn can be used to continually optimize and improve business processes. Working with Kinesis Analytics requires you to perform the following three steps. Step one, configure an input stream. Step two, offer SQL queries to perform analysis. Step three, configure an output stream and write out the analysis results. We'll drill into each of these steps in greater detail as we move forward in this course. But before we dive into the specifics of the Kinesis Analytics service, let's step back and consider what might motivate us in the first place to justify using a real-time analytic service. As seen here within this graphic, the value of data diminishes over time. The ability to maintain peak performance of a business is often related to the ability to make timely decisions. The earlier we can make informed and actionable decisions, the quicker we can adjust and maintain optimal performance, and hence highlights the importance of being able to process data in near to real time. The type of decision making we can make is based on the age of the data itself. Considering this, we can see that data processed within real time allows us to make preventative and/or predictive decisions. Data processed within seconds allows us to make actionable decisions. Data processed within minutes to hours allows us to make reactive decisions. Data processed within days to months allows us to perform business intelligence-type historical reporting. Let's now consider some sources and generators of real-time streaming data. Mobile apps, applications that continually collect your current GPS position, streaming this in real time back to a centralized service. Click streams, open source tools, such as Open Web Analytics and Piwik, which when integrated track and stream webpage user behavior. Application logs, application logging, continually collecting the behavior and operating performance of server applications. IoT, the Internet of Things, sensors that are connected to the internet, which collect information about the local environment and stream back to a centralized service, for example the IoT thermostats. Social, social apps such as Twitter and Facebook, user posts and commentary. Moving on, let's now discuss how Amazon Kinesis Analytics can be utilized within the AWS platform and integrated with other AWS services. We'll now cover off a few high level examples, each of which highlights a different use case in which Kinesis Analytics can be used to process an incoming data feed in real time, and whereby the resulting analytical insights are then disseminated to a particular endpoint. Time Series Analytics to Kibana Dashboard. In this example, Kinesis Analytics is used to process and derive time series-based analytics, the outcomes of which are published into an Elasticsearch-hosted Kibana-based dashboard. This type of architecture is great for creating key performance indicators over Time Series data. Mobile web app to RedShift with QuickSight. This example involves a mobile web application that uses the AWS JavaScript SDK to send application clickstream data into a Kinesis stream. The Kinesis stream is an input stream processed by Kinesis Analytics application. Analytics are fed out into an output Kinesis stream, which is next transformed via a Lambda ETL function. Outputs are then pushed into a RedShift database. AWS QuickSight is used to create business reports. IoT real-time monitoring. This example involves a physical IoT thermostat collecting and streaming temperature readings to an AWS IoT topic. An AWS IoT Rule is configured to process the IoT topic and publish the data into a Kinesis input stream, which itself is then used as input into a Kinesis application. The Kinesis application outputs are then delivered to an output Kineses stream. The output stream is processed by subscribing Lambda function, which converts the data into custom metrics, published into CloudWatch, allowing us to react to data through implemented alarms. Amazon Kinesis Analytics provides the following key benefits. Real-time processing, this is a fundamental characteristic and differentiator of this service as when compared with other analytics tools. Fully managed, this is a fully managed service. AWS takes care of all the maintenance and operational aspects of the service, allowing you to maintain focus on the real-time analysis of data. Automatic elasticity, AWS would adjust and scale the underlying infrastructure as and when required to ensure the continual analysis regardless of volume and velocity of the incoming data. Standard SQL, this service supports standard SQL and as such allows you to reuse your existing SQL skills. Let's now begin to detail each of the key concepts as used and required by Kinesis Analytics deployment. For starters, the first thing you must configure is an application. A Kinesis Analytics application consists of three main subcomponents. One, an input stream. Input streams typically come from streaming data sources such as Kinesis streams, but it can also come from reference data sources stored in an S3 bucket. Two, SQL processing logic, a series of SQL statements that process input and produce output. The SQL code will typically perform aggregations and generate insights. SQL statements are authored in ANSI-compliant SQL. Amazon Kinesis Analytics implements the ANSI 2008 SQL standard with extensions. Three, an output stream. Output streams can be configured to hold intermediate results that are used to feed into other queries or be used to stream out the final results. Output streams can be configured to write out the destinations such as S3, Redshift, Elasticsearch and/or other Kinesis streams. In the next few slides, we'll begin to examine each of these subcomponents in further detail. At its core, a Kinesis Analytics application is designed to continuously read and process incoming data source streams. Example of data source streams are those implemented using either the Amazon Kinesis Stream or Amazon Kinesis Firehose services. Another optional input that a Kinesis Analytics application can be configured with is that of a static reference table. A reference table in the context of a Kinesis Analytics application is one in which provides enrichment to incoming data. Reference tables are stored and pulled from S3 buckets. The enrichment process is performed at query time using SQL joins. A data schema is associated and applied to all incoming data. The data schema can also be auto detected by the Kinesis Analytics application. However, you have the ability to either entirely override this by specifying your own schema or to customize and refine the auto derived schema. The data schema for an application's input stream defines the structure of the data itself. This is similar to the process of using DDL statements in a relational database such as a create table statement. Each and every input must be associated with a data schema. As already mentioned, the default schema is automatically derived at configuration time. Although the schema is inferred during the start of data ingestion, the Kinesis Analytics provides a schema editor which gives you control to manipulate and customize the schema to either a lax or tightened control over the inputs. Your SQL querying statements that you author represent the most important part of your Kinesis Analytics application as they generate the actual analytics that you wish to derive. Your analytics are implemented using one or several SQL statements, used to process and manipulate input and produce output. You author SQL statements that query streams to derive analytics. This process can involve intermediary steps, whereby the outputs of one query feed into a second in application stream. This process can be repeated multiple times until a final desired result is achieved persisted to an output stream. Multiple SQL statements can be utilized to provide you flexibility, allowing you to model and implement different data processing patterns such as fan out, fan in, aggregation, enhancement and/or filtering. The AWS Kinesis Analytics Console provides amongst other features a query editor where we can author, format and save our analytics queries. The query editor provides a very useful SQL template feature, whereby you can start from a predefined SQL template. AWS provides several SQL templates covering many of the common requirements when needing to perform analysis over streaming data. Example templates include the following. Continuous filter, performs a continuous filter based on a where condition. Aggregate function in a tumbling time window, aggregates rows over a 10-second tumbling window for a specified column. Multi-step application, use parallel or serial processing steps, intermediate in application streams are useful for building multi-step applications. Let's now take a quick look at a full set of SQL statements used to define a sample Kinesis Analytics application. This particular example is based on the aggregate function in a tumbling time window template. One, for starters, the template defines the data structure of the output destination stream. Two, next, a pump is created which directs its outputs to the destination stream created in the first statement. Three, finally, the pump is populated with an insert statement, which itself is defined by a select statement on the input source stream. A couple of points worth mentioning. The input source stream, by default, is named source SQL stream 001. The source stream represents your configured Kinesis stream or Kinesis Firehose stream. The output destination stream, by default, is named destination SQL stream. The select statement is always used in the context of an insert statement, that is when you select rows from one in application stream, you insert results into another in application stream. The insert statement is always used in the context of a pump, that is we use pumps to write to an in application stream. A pump is the mechanism used to make an insert statement continuous. As just briefly mentioned, you can implement a pipeline involving intermediary steps in which the outputs of one SQL query are used to populate and produce another in application stream. This process involves the concept of pumps. Pumps are used to take query outputs derived from a source stream and then store them into a separate in application stream. This process is a two-step process in which you first define a new in application stream and then pump data into it. Let's step through this process. The first step is to define a new stream into which the pump will be used to pump data into. We define the data structure of our new stream. In this example, we're defining a new stream named stage1stream which has four columns. The second step is to define a new pump. Here, we've named it stage1pump. This pump inserts data into the stage1stream. The inserted data is sourced from a select query performed on an input stream called application stream. When working with and analyzing streaming data, we need to employ time-based boundaries over the stream. With this in mind, in application streams are tagged with a special meta column called rowtime. Rowtime records the timestamp when Kinesis Analytics first inserts a row into the first in application stream. The same rowtime value is kept associated with the record as it moves between any intermediary streams within an application. Note, rowtime is used to implement aggregations over tumbling time-based windows. We cover this off in more detail when we get to the query window slide. Kinesis Analytics enables you to perform continuous queries. Continuous queries are perfect for scenarios where you need to be continually alerted of data changes that match the parameters within your query's filter. Kinesis Analytics takes care of running your SQL queries continuously on data while it's in transit, sending results to your required destinations. Once a continuous query is started, you'll get notified of all the data changes that fall into your query filter if any. For example, suppose you have an IoT device that streams heart rate information once per minute and you want to be notified when a heart rate change greater than 25% occurs. You can use the following query in your application code. This query runs continuously and emits records when a heart change greater than 25% is detected. As mentioned earlier, SQL queries in your application code are executed continuously over in application streams. To derive result sets from continually changing and unbounded data, you often need to bound your SQL queries using a time or row-based window. Windows in this context define the start and end of a query. SQL queries can be bounded by either time or row-based windows. Let's now discuss each of these individually. Time-based windows. A time-based windowed query is defined by specifying the window boundaries in terms of time. For example, we could define either a one-minute time-based window or something longer such as a 10-minute based window. Time-based windows utilize the timestamp column in your in application stream that is monotonically increasing. We earlier spoke of the rowtime column that Kinesis Analytics automatically tags on each coming record. Rowtime can be used to calculate time-based windows. Row-based windows. A row-based windowed query is defined by specifying the window size in terms of number of rows. Kinesis Analytics supports three types of windows, tumbling, sliding and custom. We'll now go over each of these individually. A tumbling window is a fixed size window where the bound to the window do not overlap with the window either directly before or directly after. The start of a new tumbling window begins with the end of the old window. Tumbling windows are often used to create time range-based reports. For example, you can use a tumbling window to compute the average number of clicks for a given webpage in the last 10 minutes. A sliding window is a fixed size window where the bound to the window do overlap with the window either directly before or directly after. Sliding windows emit new results any time a new row enters the window. A record can be part of multiple windows and can be processed with each window. Sliding windows are useful for maintaining KPIs and trend-based data reports. Custom windows are useful for event correlation. Web app sessionization is a good example where custom windows can be used. Sessionization is the process whereby a user's activity is tracked and recorded using a session id allocated against that user. In this case, a custom window will be defined based on the user's session id and would allow all actions completed by a user during a particular session to be reported on.

About the Author
Learning Paths

Jeremy is a Content Lead Architect and DevOps SME here at Cloud Academy where he specializes in developing DevOps technical training documentation.

He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 25+ years. In recent times, Jeremy has been focused on DevOps, Cloud (AWS, Azure, GCP), Security, Kubernetes, and Machine Learning.

Jeremy holds professional certifications for AWS, Azure, GCP, Terraform, Kubernetes (CKA, CKAD, CKS).