Introduction to Streaming Data

Designing a streaming data pipeline presents many challenges, particularly around specific technology requirements. When designing a cloud-based solution, an architect is no longer faced with the question, “How do I get this job done with the technology we have?” but rather, “What is the right technology to support my use case?”

In this blog post, we will walk through some initial scoping steps and walk through an example. To learn more about data streaming, check out Building a Solution Using Artificial Intelligence and IOT.

Streaming Data Course: Building a Solution Using Artificial Intelligence and IOT Streaming Data Course: Building a Solution Using Artificial Intelligence and IOT

The first questions

Regardless of exact details, four initial questions can be asked to start narrowing down what needs to be done:

1. What are the 3Vs of data?

This question is key to understanding the data itself. The 3Vs are Volume, Velocity, and Variety. Basically, the point of this question is to understand how much data there is (Volume), how fast it is coming in (Velocity), and how standardized it is (Variety).  With this question, a non-descriptive statement such as, “We have 10gb/day of data” is further refined into “We have 1,000,000 10kb web-app logs a day that come in from our front-end web app.”

2. Where does the data need to go?

Data movement, especially between locations, can be a difficult challenge. This question is particularly relevant to the IoT field where data generation is often a remote site and not in the home data center. Even for non IoT use cases, understanding if the data is on an application server, a core database, or even an FTP server is key. Knowing where the pipeline starts and stops will allow you to hook into the existing systems with ease.

3. What format and condition is the data in?

This question overlaps some with the first one, 3Vs of data, but allows us to define with more granularity how much massaging is needed. Is the data in avro, csv, or already in a SQL database? If it is csv data, are there improper commas in the free-text fields? This question allows us to understand the “start” of the pipe’s requirements.  

4. What do we need to do with the data?

Very rarely does a streaming pipeline need to simply pickup and drop off data. Often it needs to be enriched against a database, transformed, and cleaned before being dropped off. This question is potentially the most complicated to plan for as it could involve deep introspection into the contents and syntax of the data itself.

Streaming Data Pipeline

Streaming data vs batch processing

The difference between batch processing and streaming is contentious and outdated. Although there are academic differences between batch and stream processing, from a practical perspective, architects should ask themselves, “How much lag is acceptable between data availability and output?” 

Being able to define a requirement as “three seconds between data being on the FTP and being processed into mysql” is much more descriptive and useful than “micro batching.” Technologies may lend themselves to lower latencies (e.g., NiFi, Spark Streaming, Beam) vs higher latencies(Map Reduce), but they should be thought of as a sliding scale vs distinctly different approaches.

Walking through an example

Recently, one use case we had to work through was helping a manufacturing company connect its machines to the cloud for centralized monitoring. At first, this can seem like a relatively large task, but it becomes much more manageable when broken down with scoping questions. (Disclaimer: Details and numbers have been changed for privacy reasons).

Scoping questions

What are the 3Vs of the data?

Production logs will be made available every 4-5 minutes and be an average of 500kb in size. They are extremely standardized as they are created by the instrument’s control software. Across all of the instruments, the expected data rate is below 200mb/hour.

This gives us insight as to what our system will need to handle. The size of incoming messages of 500kb may be a problem for some systems, but the overall data flow rate is manageable. Standardized messages are generally easier to handle than non-standard messages.

Where does the data need to go?

Files are being generated on machines that sit on the factory floor. These instruments have an SFTP server on them that allows remote access. Ultimately, the data needs to be put into the central quality tracking system. This system has an RDBMS at its core and exposes a JDBC interface for integration.  

Both the pick-up and drop-off point are easily accessible with common, easy-to-use interfaces.

What format and condition is the data in?

The data is in standardized XML format.  These are consistently generated per run.

What do we need to do with the data?

There is a lot of junk data in the machine logs. We need to strip out everything except a few fields pertaining to quality and throughput. This needs to be made available as quickly as possible to detect any errors or problems on the manufacturing floor.

Next steps

This blog post should be considered a primer to thinking about streaming data problems. For a deeper dive into case studies and a practical example, check out Calculated System’s Streaming Data Example or Automated Manufacturing Case Study.

Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data. Cloud Academy offers a Getting Started with Amazon Kinesis Learning Path — objective-driven learning modules made up of Courses, Quizzes, Hands-on Labs, and Exams. In this Learning Path, you will learn the Amazon Kinesis components and use cases, and understand how to create solutions and solve business problems.

Cloud Academy: Getting Started with Amazon Kinesis
Cloud Academy