image
Fundamentals of Stream Processing
Fundamentals of Stream Processing
Difficulty
Intermediate
Duration
25m
Students
3452
Ratings
4.9/5
Description

In this course, we take a look at streaming data, why it's important, and how Amazon Kinesis is used to stream data into the AWS cloud.

You'll learn what data streaming is, the problems it solves, and, how Amazon Kinesis addresses them.

We'll also cover, at a very high level, what services exist inside Amazon Kinesis.  These are Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics, and Kinesis Video Streams.

Learning Objectives

  • Understand the fundamentals of streams processing
  • Learn about the features of Amazon Kinesis
  • Learn about the services that make up Amazon Kinesis

Intended Audience

This course is intended for people that want to learn about streaming data, why it's important, and how Amazon Kinesis is used to send data into the AWS cloud.

Prerequisites

  • This course assumes no prior knowledge of Amazon Kinesis, streaming data, or its internals.
  • A general understanding of the AWS cloud.
Transcript

Stream processing, Machine Learning, and Artificial Intelligence are popular topics inside cloud computing.  More and more companies seem to be using modern stream processing tools.  Cloud providers like AWS are releasing better and more powerful streaming products, and specialists are increasingly in high demand.

However, what--exactly--is stream processing?  The answer is complex but not complicated.  It is a large and varied topic.  Streaming data gained importance because not all data is created equally and its value changes over time.

Some information has value that can be measured in years.  Other data has great value but only at the moment it is produced.

Before stream processing existed, large volumes of data were usually stored either in a database or on an enterprise-class server and processed all at the same time.  The analysis of this data was performed using what we now call batch processing because, as it sounds, it was done in a single "batch."

With batch processing, data is collected, stored, and analyzed in chunks of a fixed size on a regular schedule.  The schedule depends on the frequency of data collection and the related value of the insight gained.  It's this value that is at the center of stream processing.

As I mentioned already, some information is nearly timeless.  Think of learning the alphabet, the order of operations in algebra, or the names of people and places.  This data is slow to change--if at all--and its value remains relatively constant.

However, some information is only valuable at the moment it's being accessed and processed.  Time-critical data is used for preventative maintenance or to react to one or more events in real time.

Consider those moments in your life when someone made a comment that left you speechless.  It's the perfect time for a witty comeback or retort and you have nothing to say. After walking away, you think of that perfect response but it is too late.  The moment is gone forever. 

There's no word that I know of, in English, to describe this phenomenon but, in French, the phrase is "l'esprit de l'escalier" and, in German, "Treppenwitz."  Depending on the translation, the words mean “the spirit of the staircase” or “the wit--or joke--of the staircase."  

That is, after you've walked away from the situation and started down a flight of stairs, that's when you think of the exact right thing to say.  It's way too late to be funny, witty, or otherwise engaging.

This describes how some data loses value over time.  As transactions happen, at the moment, that's when the data has value.  It might be a recommendation engine that suggests an additional item, doing sentiment analysis to determine how a person feels about a product, or anomaly detection for IoT hardware failures.

Processing this type of data minutes, hours, or even days later becomes a type of l'esprit de l'escalier or Treppenwitz; a staircase joke.  Except, it's not funny; sales are lost, people are angry or frustrated, and devices fail.

The l'esprit de l'escalier problem is really an issue of latency and the value of data over time.

After latency, there are two other issues related to batch processing that impact the value of data.  

The first issue that I'd like to mention involves session states.  In this context, think of a session as a collection of events or transactions that are related.  Batch processing systems split data into time intervals that are consistent and evenly spaced.  

This creates a steady workload that is predictable.  The trouble, here, is that, while predictable, it has no intelligence.  Sessions that begin in one batch might end in a different one.  This makes the analysis of related transactions difficult.  

The second issue is that batch processing systems are also designed to wait until a specific amount of data is accumulated before processing starts.  

Batch architectures have been optimized to process large amounts of data at a single time.  So, an analysis job might have to wait for extended periods of time because the queue needs to be full before processing can begin.

While the size of the batch job is uniform, the time period in each batch of data is inconsistent.  

Steam processing, then, was created to address these issues of latency, session boundaries, and inconsistent load.   

The term streaming is used to describe information as it flows continuously without a beginning or an end.

It is never-ending and provides a constant feed of events that can be acted upon without the need to be downloaded first.

A simple analogy is how water flows through a river or creek. Water comes from various sources, in varying speed and volumes and flows into a single, continuous, combined stream.

Similarly, data streams are generated by all types of sources, in various formats and volumes. 

These sources can be applications, networking devices, server log files, website activity, banking transactions, and location data.  

All of them can all be aggregated in real-time to respond and perform analytics from a single source of truth.

Stream processing, then, is acting on--or reacting to--data while it is motion.  Computation happens in the moment data is produced or received.  

When receiving an event from the stream, a Stream processing application reacts to it. This reaction might be to trigger an action, update an aggregate or similar statistic, or cache the event for future reference.

Multiple data streams can be processed simultaneously and Consumers, applications that process the data from a stream, can create new data streams.

An example of steam processing involves credit card fraud alerting.  

Speaking from experience, I've been on trips where, after using my credit card in a city far from home, I've received a text message on my phone asking if a recent transaction was legitimate.  

This type of processing requires evaluating data in real-time.  This includes the transaction, sending alerts, receiving a response, and acting on the response.

Instead of batches of a fixed size, a data stream is a collection of related events or transactions. 

Typically, a stream application has three main parts; Producers, a Data Stream, and Consumers.  

Producers collect events or transactions and put them into a Data Stream.   The Data Stream, itself, stores the data.  Consumers access streams, read data, then act on it. 

There are a number of benefits for using streaming data frameworks.  

Some data naturally comes as a never-ending stream of events and is best processed while it is in-flight

Batch processing is built around a data-at-rest architecture. Before processing can begin, the collection has to be stopped and the data must be stored.

Subsequent batches of collected data bring the need to create an aggregate across multiple batches.

In contrast to this, streaming architectures handle never-ending data flows naturally and   with grace.  Using streams, patterns can be detected, results inspected, and multiple streams can be examined simultaneously.

Sometimes, the volume of data is larger than the existing storage capacity.  

Yes, in the cloud, there seems to be no limit to available storage but that storage comes with a price tag.  

It also comes with a human failing.  Sometimes, people are afraid to delete data because they might need it later.  

Using streams, raw data is processed in real-time and you retain only the information and insight that is useful.  

Stream processing naturally fits with time-series data and the detection of patterns over time. 

For example, when trying to detect a sequence such as the length of a web session in a continuous stream of data, it would be difficult to do in batches.

Time series data, such as that produced by IoT sensors, is continuous and fits naturally into a streaming data architecture.

There’s no almost no lag time between when events happen, insights are derived, and actions are taken.

Actions and analytics are up-to-date and reflect the state of the data while it is still fresh, meaningful, and valuable.

Streaming reduces the need for large and expensive shared databases. 

When using a streaming framework, each stream processing application maintains its own data and state, and--because of this--stream processing fits naturally inside a microservices architecture.

Something that I think is important to mention is that batch processing is still required.  Stream processing compliments batch computing.

Month-end billing is still best done using some sort of batch process.  The value of billing data remains substantial and predictable.  Large scale reporting does not need expensive, high-speed, low-latency compute engines.   It's just that, as a consumer, I don't want to have to wait 30-45 days to learn about possible fraud.

Stream processing is used to collect, process, and query data in either real time or near real time to detect anomalies, generate awareness, or gain insight.  

Real-time data processing is needed because, for some types of information, the data has actionable value in the moment it was collected and its value diminishes, rapidly, over time.  

Stream processing can provide actionable insights within milliseconds to seconds of a recorded event.

So how important is stream processing? 

A better question, perhaps, is how important and/or useful is it to have immediate insight into how the business is operating, how customers feel, or what devices are online and in use? 

Consider real-time trading in commodities; a fraction of a second advantage can translate into millions in profit or loss. 

What about major consumer product companies doing global launches of products where millions of people log in at the same time to purchase?  On days like Black Friday or Cyber Monday people expect fast and consistent responses. 

Not every transaction requires an immediate response, but there are many that do.

Businesses that specialize in e-commerce, finance, healthcare, and security need immediate responses and this is the target market for stream processing. 

The problem is that companies need the ability to recognize that something important has happened and they need to be able to act on it in a meaningful and immediate way. 

Immediacy matters because data can be highly perishable and its shelf life can be measured in milliseconds.

This brings me to the end of this lecture.  Thank you for watching and letting me be part of your cloud journey.

If you have any feedback, positive or negative, please contact us at support@cloudacademy.com, your feedback is greatly appreciated.

For Cloud Academy, I'm Stephen Cole.  Thank you!

About the Author
Students
35369
Courses
20
Learning Paths
16

Stephen is the AWS Certification Specialist at Cloud Academy. His content focuses heavily on topics related to certification on Amazon Web Services technologies. He loves teaching and believes that there are no shortcuts to certification but it is possible to find the right path and course of study.

Stephen has worked in IT for over 25 years in roles ranging from tech support to systems engineering. At one point, he taught computer network technology at a community college in Washington state.

Before coming to Cloud Academy, Stephen worked as a trainer and curriculum developer at AWS and brings a wealth of knowledge and experience in cloud technologies.

In his spare time, Stephen enjoys reading, sudoku, gaming, and modern square dancing.