In course one of the AWS Big Data Specialty Data Collection learning path we explain the various data collection methods and techniques for determining the operational characteristics of a collection system. We explore how to define a collection system able to handle the frequency of data change and the type of data being ingested. We identify how to enforce data properties such as order, data structure, and metadata, and to ensure the durability and availability for our collection approach.
Learning Objectives
- Recognize and explain the operational characteristics of a collection system.
- Recognize and explain how a collection system can be designed to handle the frequency of data change and the type of data being ingested.
- Recognize and identify properties that may need to be enforced by a collection system.
Intended Audience
This course is intended for students looking to increase their knowledge of data collection methods and techniques with big data solutions.
Prerequisites
While there are no formal prerequisites, students will benefit from having a basic understanding of analytics services available in AWS. Please take a look at our Analytics Fundamentals for AWS
This Course Includes
- 45 minutes of high-definition videos
- Live hands-on demos
What You'll Learn
- Introduction to Collecting Data: In this lesson, we'll prepare you for what we'll be covering in the course; the Big Data collection services of AWS Data Pipeline, Amazon Kinesis, and AWS Snowball.
- Introduction to Data Pipeline: In this lesson, we'll discuss the basics of Data Pipeline.
- AWS Data Pipeline Architecture: In this lesson, we'll go into more detail about the architecture that underpins the AWS Data Pipeline Big Data Service.
- AWS Data Pipeline Core Concepts: In this lesson, we'll discuss how we define data nodes, access, activities, schedules, and resources.
- AWS Data Pipeline Reference Architecture: In this lesson, we'll look at a real-life scenario of how data pipeline can be used.
- Introduction to AWS Kinesis: In this lesson, we'll take a top-level view of Kinesis and its uses.
- Kinesis Streams Architecture: In this lesson, we'll look at the architecture that underpins Kinesis.
- Kinesis Streams Core Concepts: In this lesson, we'll dig deeper into the data records.
- Kinesis Streams Firehose Architecture: In this lesson, we'll look at firehose architecture and the differences between it and Amazon Kinesis Streams.
- Firehose Core Concepts: Let's take a deeper look at some details about the Firehose service.
- Kinesis Wrap-Up: In this summary, we'll look at the differences between Kinesis and Firehose.
- Introduction to Snowball: Overview of the Snowball Service.
- Snowball Architecture: Let's have a look at the architecture that underpins the AWS Snowball big data service
- Snowball Core Concepts: In this lesson, we'll look at the details of how Snowball is engineered to support data transfer.
- Snowball Wrap-Up: A brief summary of Snowball and our course.
Okay, let's have a look at the data architecture that underpins the AWS Data Pipeline big data service. In AWS Data Pipeline, data nodes and activities are the core components in the architecture. A data node is the location of input data for a task or the location where output data is to be stored.
Activities are a definition of work to perform on a scheduled basis using a computational resource and typically consuming data from input data nodes and storing the results on output data nodes. You effectively daisy chain these objects together to create a workflow process.
These workflow processes are called pipelines, and the pipeline definition specifies the end-to-end business logic of your data management processes. Schedules define the timing of a scheduled event, such as when an activity runs. Preconditions are a component of your pipeline definition which contain conditional statements that must be true before an activity can run. For example, a precondition can check whether source data is present before a pipeline activity attempts to copy it. The task runner polls for tasks and then performs those tasks. For example, task runner could copy log files to Amazon S3 and launch Amazon EMR clusters.
Task runner is installed and runs automatically on resources created by your pipeline definitions or on your own dedicated compute server. You can write a custom task runner application or you can use the task runner application that is provided by the AWS Data Pipeline service. Task runners use the resources you have defined to execute the activities. A resource is the computational resource that performs the work that a pipeline activity specifies. AWS Data Pipeline supports the use of EC2s and EMR clusters as those resources.
Before we go into each of the options in detail, let's have a quick look at how AWS makes things easier for you. One of the great things about AWS is they always try and make things easy for you. So when you go to create a new AWS Data Pipeline definition in the console, there are a number of pre-baked blueprints that help you with scenarios where your data pipeline is commonly used. You can confine your pipeline using pre-defined templates, import a definition defined previously, or build a pipeline definition from scratch using the architect feature. You also define the schedule where you wish this data pipeline to execute, either based on the pipeline being run as soon as it is activated or scheduled to run on a regular basis.
You also need to define whether to log the pipeline execution and where to log it, as well as any required security permissions. As I have mentioned, you can use pre-defined templates that AWS have defined for column data pipelines; for example, loading data from Amazon S3 into Amazon Redshift. Selecting one of these templates will create additional parameter options for you to complete and then will generate the pipeline definition as if you had manually created it in the architect yourself. This is an example of the parameters required to copy MySQL data and AdEase into Amazon Redshift.
One thing to note, this pipeline launches an Amazon EC2 instance, a T1 micro in this case, in your account on every scheduled execution of the pipeline as the resource performing the activity; so watch out for the cost implications. When you select to run a job on AWS EMR, you will need to define the EMR steps you wish to execute. We cover EMR steps in the big data processing course, as well as the core and node instance types. Again, note this pipeline launches an Amazon EMR cluster, in one medium in this case, in your account on every scheduled execution of the pipeline as the resource performing the activity.
So, again, watch out for the cost implications of this compute. Once you have defined your pipeline, you can view and edit it in the architect screen, which allows you to see the workflow of your activities in a graphical way, as well as edit each data node or activity parameter using the options on the right-hand side of your screen. The architect's a great way to quickly see what activities your pipeline is executing, as well as reviewing or editing the parameters for each activity. For those of you that are familiar with traditional ETL style tools, this flow-based visualization will be very familiar.
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.