In course one of the AWS Big Data Specialty Data Collection learning path we explain the various data collection methods and techniques for determining the operational characteristics of a collection system. We explore how to define a collection system able to handle the frequency of data change and the type of data being ingested. We identify how to enforce data properties such as order, data structure, and metadata, and to ensure the durability and availability for our collection approach.
Learning Objectives
- Recognize and explain the operational characteristics of a collection system.
- Recognize and explain how a collection system can be designed to handle the frequency of data change and the type of data being ingested.
- Recognize and identify properties that may need to be enforced by a collection system.
Intended Audience
This course is intended for students looking to increase their knowledge of data collection methods and techniques with big data solutions.
Prerequisites
While there are no formal prerequisites, students will benefit from having a basic understanding of analytics services available in AWS. Please take a look at our Analytics Fundamentals for AWS
This Course Includes
- 45 minutes of high-definition videos
- Live hands-on demos
What You'll Learn
- Introduction to Collecting Data: In this lesson, we'll prepare you for what we'll be covering in the course; the Big Data collection services of AWS Data Pipeline, Amazon Kinesis, and AWS Snowball.
- Introduction to Data Pipeline: In this lesson, we'll discuss the basics of Data Pipeline.
- AWS Data Pipeline Architecture: In this lesson, we'll go into more detail about the architecture that underpins the AWS Data Pipeline Big Data Service.
- AWS Data Pipeline Core Concepts: In this lesson, we'll discuss how we define data nodes, access, activities, schedules, and resources.
- AWS Data Pipeline Reference Architecture: In this lesson, we'll look at a real-life scenario of how data pipeline can be used.
- Introduction to AWS Kinesis: In this lesson, we'll take a top-level view of Kinesis and its uses.
- Kinesis Streams Architecture: In this lesson, we'll look at the architecture that underpins Kinesis.
- Kinesis Streams Core Concepts: In this lesson, we'll dig deeper into the data records.
- Kinesis Streams Firehose Architecture: In this lesson, we'll look at firehose architecture and the differences between it and Amazon Kinesis Streams.
- Firehose Core Concepts: Let's take a deeper look at some details about the Firehose service.
- Kinesis Wrap-Up: In this summary, we'll look at the differences between Kinesis and Firehose.
- Introduction to Snowball: Overview of the Snowball Service.
- Snowball Architecture: Let's have a look at the architecture that underpins the AWS Snowball big data service
- Snowball Core Concepts: In this lesson, we'll look at the details of how Snowball is engineered to support data transfer.
- Snowball Wrap-Up: A brief summary of Snowball and our course.
Okay, let's have a look at the architecture that underpins the Amazon Kinesis Firehose, big data service. While still under the Kinesis moniker, the Amazon Kinesis Firehouse architecture is different to that of Amazon Kinesis Streams. Amazon Kinesis Firehouse is still based on a platform as a service style architecture.
When you determine the through-put of the capacity you require and the architecture and components help automatically provision and stored and configured for you. You have no need or ability to change the way these architectural components are deployed. Amazon Kinesis Firehose is a fully-menued service for delivering realtime streaming data to destinations such as Amazon Simple Storage Service or S3, Amazon Redshift or Amazon Elasticsearch Service.
With Kinesis Firehose, you do not need to write applications or manage resources. You configure your data producers to send data to Kinesis Firehose and are automatically delivers the data to the destination that you specify. You can also configure Amazon Kinesis Firehose to transform your data before data delivery. Unlike some of the other Amazon big data services which have a container that the service sits within, for example, a DB instance with an Amazon RDS, Amazon Kinesis Firehose doesn't.
The container's effectively the combination of the account and the region you provision the Kinesis delivery streams within. The delivery stream is the underlying entity of Kinesis Firehose. You use Kinesis Firehose by creating a Kinesis Firehose delivery stream and then sending data to it which means each delivery stream is effectively defined by the target system that receives the restreamed data. Data producers send records to Kinesis Firehose delivery streams. For example, a web service sending log data to Kinesis Firehose delivery stream is a data producer.
Each delivery stream stores data records for up to 24 hours in case the delivery destination is unavailable. The Kinesis Firehose destination is the data store where the data will be delivered. Amazon Kinesis Firehose currently supports Amazon S3, Amazon Redshift, and Amazon Elasticsearch as service destinations. Within the delivery stream is a data flow which is effectively the transfer and load process. The data flow is predetermined based on what target data source you configure your delivery stream to load data into.
So for example, if you are loading into Amazon Redshift, the data flow defines the process of landing the data into an S3 bucket and then invoking the copy command to load the Redshift table. Kinesis Firehose can also invoke an AWS Lambda function to transform incoming data before delivering it to the selected destination. You can configure a new Lambda function using one of the Lambda blueprints AWS provides or choosing an existing Lambda function. Before we go into each of the options and detail, let's have a quick look at how AWS makes things easier for you.
One of the great things about AWS is that they always try and make things easy for you. So when you go to create a new Amazon Kinesis Firehose definition in the console, there are a number of prebate destinations that would help you with streaming data into a AWS big data storage solution. As you can see, you can select one of the three data services currently available as a target, S3, Redshift, or Elasticsearch. Selecting one of these destinations will create additional parameter options for you to complete to assist in creating the data flow.
If we chose Amazon S3 as a destination data source, then the relevant parameters are displayed to be completed. If we chose Amazon Redshift as a destination target, you can see we get a different set of parameters as you would expect. Note that we are required to define both an S3 bucket and a Redshift target database in this scenario as Amazon Kinesis Firehose is leveraging the Amazon Redshift copy capability to load the data. Going back to the Amazon S3 scenario, we also have the ability to consume an AWS Lambda function as part of the loading process to transform the data on the way through.
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.