Introduction to Data Pipeline
Introduction to Data Pipeline

This course is designed to show you how to use the AWS Data Pipeline service for data collection requirements. 

Learning Objectives

  • Recognize and explain the operational characteristics of a data collection system.
  • Recognize and explain how a collection system can be designed to handle the frequency of data change and the type of data being ingested.
  • Recognize and identify properties that may need to be enforced by a collection system.
  • Recognize and explain AWS Data Pipeline core concepts.

Intended Audience

This course is intended for students looking to increase their knowledge of data collection methods and techniques with big data solutions.

What You'll Learn

  • Introduction to Data Pipeline: In this lesson, we'll discuss the basics of Data Pipeline.
  • AWS Data Pipeline Architecture: In this lesson, we'll go into more detail about the architecture that underpins the AWS Data Pipeline Big Data Service.
  • AWS Data Pipeline Core Concepts: In this lesson, we'll discuss how we define data nodes, access, activities, schedules, and resources.
  • AWS Data Pipeline Reference Architecture: In this lesson, we'll look at a real-life scenario of how data pipeline can be used.

Welcome to Big Data on AWS. We're looking at collecting data with AWS Data Pipeline. At the end of this module, you will be able to describe in detail how AWS Data Pipeline can be used to collect data within a big data solution. We kickoff by covering the Amazon services that enable us to click data before moving on to services which enable us to store, process, analyze, and visualize big data.

We will start of by looking at AWS Data Pipeline and then we will move on to a module to look at Amazon Kinesis then round it out with a module on AWS Snowball. AWS Data Pipeline is primarily designed to collect data from a data source and move it to a target data source. AWS Data Pipeline also allows you to process data as you move it. So it is often used as the core service within a big data analytics solution or as a modern extract, transform, and load ETO capability. AWS Data Pipeline helps you create complex data workloads that are fault tolerant, repeatable, and highly available. By executing the scheduling retry and failure logic for these weak flows is a highly-scalable and fully-managed service.

When choosing a big data processing solution from within the available AWS service offerings, it is important to determine whether you need the latency response from the process to be in seconds, minutes, or hours. This will typically drive the decision on which AWS service is the best for processing pattern or use case. AWS Data Pipeline is dependent on the way you define your pipeline for its processing pattern. Effectively, the closest processing pattern is a batch execution, or typical ETO work flow paradigm. One of the interesting things when we look at the storage patterns is that AWS Data Pipeline does not store persistent data itself.

And, like many of the other Amazon big data services, AWS Data Pipeline needs to be deployed as part of a larger solution where you define a target, big data solution that will store the results of the pipeline process. AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services as well as on-premise data sources at specified intervals. With AWS Data Pipeline, you can regularly access your data where it's stored it, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR. AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available.

You don't have to worry about insuring resource availability, managing inter-task dependencies, retrying transient failures, or time-outs and individual tasks, or creating a failure-notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos.

About the Author

Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.

He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.

Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.