In course one of the AWS Big Data Specialty Data Collection learning path we explain the various data collection methods and techniques for determining the operational characteristics of a collection system. We explore how to define a collection system able to handle the frequency of data change and the type of data being ingested. We identify how to enforce data properties such as order, data structure, and metadata, and to ensure the durability and availability for our collection approach.
Learning Objectives
- Recognize and explain the operational characteristics of a collection system.
- Recognize and explain how a collection system can be designed to handle the frequency of data change and the type of data being ingested.
- Recognize and identify properties that may need to be enforced by a collection system.
Intended Audience
This course is intended for students looking to increase their knowledge of data collection methods and techniques with big data solutions.
Prerequisites
While there are no formal prerequisites, students will benefit from having a basic understanding of analytics services available in AWS. Please take a look at our Analytics Fundamentals for AWS
This Course Includes
- 45 minutes of high-definition videos
- Live hands-on demos
What You'll Learn
- Introduction to Collecting Data: In this lesson, we'll prepare you for what we'll be covering in the course; the Big Data collection services of AWS Data Pipeline, Amazon Kinesis, and AWS Snowball.
- Introduction to Data Pipeline: In this lesson, we'll discuss the basics of Data Pipeline.
- AWS Data Pipeline Architecture: In this lesson, we'll go into more detail about the architecture that underpins the AWS Data Pipeline Big Data Service.
- AWS Data Pipeline Core Concepts: In this lesson, we'll discuss how we define data nodes, access, activities, schedules, and resources.
- AWS Data Pipeline Reference Architecture: In this lesson, we'll look at a real-life scenario of how data pipeline can be used.
- Introduction to AWS Kinesis: In this lesson, we'll take a top-level view of Kinesis and its uses.
- Kinesis Streams Architecture: In this lesson, we'll look at the architecture that underpins Kinesis.
- Kinesis Streams Core Concepts: In this lesson, we'll dig deeper into the data records.
- Kinesis Streams Firehose Architecture: In this lesson, we'll look at firehose architecture and the differences between it and Amazon Kinesis Streams.
- Firehose Core Concepts: Let's take a deeper look at some details about the Firehose service.
- Kinesis Wrap-Up: In this summary, we'll look at the differences between Kinesis and Firehose.
- Introduction to Snowball: Overview of the Snowball Service.
- Snowball Architecture: Let's have a look at the architecture that underpins the AWS Snowball big data service
- Snowball Core Concepts: In this lesson, we'll look at the details of how Snowball is engineered to support data transfer.
- Snowball Wrap-Up: A brief summary of Snowball and our course.
Let's have a look at the core concepts that underpin the AWS Data Pipeline Big Data Service. A pipeline definition is how you communicate your business logic to AWS Data Pipeline. It contains the information for all the components required for your pipeline to successfully execute. It contains a definition of the data nodes to access, the activities to perform, the schedule times to run, the resources required to execute the activities on, any preconditions required before running and the ways you require to alert you of the status of your pipeline.
From your pipeline definition, AWS Data Pipeline determines the tasks that will occur, schedules them and assigns them to task runs. If a task is not completed successfully, AWS Data Pipeline retries the tasks according to your instructions. If necessary, reassigns it to another task runner. If the task fails repeatedly, you can configure the pipeline to notify you. In AWS Data Pipeline, a data node defines the location and type of data that a pipeline activity uses as an input or an output. AWS Data Pipeline supports DynamoDB, RDS SQL, Redshift and S3 data nodes. When you select one of the data nodes in your pipeline definition, you define the relevant parameters for connecting to that data source. The parameters required are, of course, different for each data source.
Remember, data nodes are used for accessing both data from the source of the data, as well as the target, where the data will be loaded into. AWS Data Pipeline supports accessing databases via JDBC, as well as being able to access Amazon RDS databases and Amazon Redshift. In AWS Data Pipeline, an activity is a pipeline component that defines the work to perform. AWS Data Pipeline provides several pre-packaged activities that accommodate common scenarios, such as moving data from one location to another, running Hive queries.
Activities are extensible, so you can run your own custom scripts to support endless combinations. When you define your pipeline, you can choose to execute it on Activation or create a schedule to execute it on a regular basis. Schedules define when your pipeline activities run and the frequency with which these services expect your data to be available. All schedules must have a start date and a frequency.
For example, every day starting on January the 1st 2013 at 3:00 PM. Schedules can optionally have an end date, after which time the AWS Data Pipeline service does not execute any activities. When you associate a schedule with an activity, the activity runs on that scheduled basis. When you associate a schedule with a data source, you are telling AWS Data Pipeline service that you expect the data to be updated on that schedule. For example, if you define an Amazon S3 data source with an hourly schedule, the service expects that the data source contains new files every hour. You can define multiple schedule objects in your pipeline definition file and associate the desired schedule to the correct activity via its schedule field.
Task Runner is a task agent application that polls AWS Data Pipeline for scheduled tasks and executes them on EC2 instances or Amazon EMR clusters or other computational resources. Task Runner is the default implementation provided by AWS Pipeline. When Task Runner is installed and configured, it polls AWS Data Pipeline for tasks associated with pipelines that you have activated. When a task is assigned to Task Runner, it performs a task and reports its status back to AWS Pipeline. The diagram illustrates how AWS Data Pipeline and a Task Runner interact to process a scheduled task.
A task is a discreet unit of work that the AWS Data Pipeline service shares with a Task Runner but you can provide your own customized Task Runner capability if you wish. There are two ways you can use Task Runner to process your pipeline. One is to enable AWS Data Pipeline to install Task Runner for you on resources that are launched and managed by the AWS Data Pipeline web service.
The second, is to install Task Runner on a computational resource that you manage, such as a long-running EC2 instance or an on-premise server. In AWS Data Pipeline, a precondition is a pipeline component containing conditional statements that must be true before an activity can run. For example, a precondition can check whether source data is present before a pipeline activity attempts to copy it.
AWS Data Pipeline provides several pre-packaged preconditions that accommodates common scenarios, such as whether a database table exists or whether an Amazon S3 key is present. However, preconditions are extensible and allow you to run your own custom scripts to support endless combinations. There are two types of preconditions. System-managed preconditions and user-managed preconditions. System-managed preconditions are run by the AWS Data Pipeline web service on your behalf and do not require a computational resource. User-managed preconditions only run on the computational resource that you have specified. You can create, access and manage your pipelines using any of the following interfaces.
The AWS Management Console provides a drag-and-drop web interface that you can use to create and manage pipeline definitions. The AWS Command Line Interface provides commands for a broad set of AWS services, including AWS Data Pipeline, and is supported on Windows, Mac and Linux. The AWS SDKs provide language-specific APIs and takes care of many of the connection details, such as calculating signatures, handling request retries and error handling. The Query API provides low-level APIs that you call using https requests. Using the Query API is the most direct way to access the AWS Data Pipeline but it requires that your application handling low-level details, such as generating the hash to sign the request and error handling.
AWS Data Pipeline works with the following services to access and store data. Amazon DynamoDB, Amazon RDS, Amazon Redshift and Amazon S3. AWS Data Pipeline works with the Amazon EC2 compute service and the Amazon EMR processing service as resources you can leverage to transform data. Templates are provided to enable you to easily use these as part of your pipeline definitions. There are a number of limits within the AWS Data Pipeline service that you need to be aware of. AWS Data Pipeline has both accounts limits and web service limits. AWS Data Pipeline imposes limits on the resources that you can allocate and the rate at which you can allocate resources. The displayed limits apply to a single AWS account. If you require additional capacity, you can use the standard Amazon process to increase the limits for your account where the limit is flagged as 'adjustable'.
As I mentioned, AWS Data Pipeline has both accounts limits and web service limits. AWS Data Pipeline limits the rate at which you can call the web service API. These limits also apply to AWS Data Pipeline agents that call the web service API on your behalf, such as the Console, the CLI and the Task Runner. The limits apply to a single AWS account. The burst rate, which you save up web service calls during periods of inactivity and expend them all in a short amount of time.
For example, CreatePipeline has a regular rate of one call each one second. If you don't call the service for 30 seconds, you will have 30 calls saved up. You could then call the web service 31 times in a second, because this is below the burst limit and keeps your average calls at the regular rate limit. Your calls are not throttled. If you exceed the rate limit or the burst limit, your web service call fails and it returns a throttling exception. AWS Data Pipeline scales to accommodate a huge number of concurrent tasks and you can configure it to automatically create the resources necessary to handle large workloads.
These automatically created resources are under your control and count against your AWS account resource limits for other services. For example, if you configure AWS Data Pipeline to automatically create a 20-node Amazon EMR cluster to process data and your AWS account has an EC2 instance limit set to 20, you may inadvertently exhaust your available backfill resources. As a result, consider these resource restrictions in your design or increase your account's limits accordingly.
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.