The course is part of this learning path
AWS Data Pipeline
In course one of the AWS Big Data Specialty Data Collection learning path we explain the various data collection methods and techniques for determining the operational characteristics of a collection system. We explore how to define a collection system able to handle the frequency of data change and the type of data being ingested. We identify how to enforce data properties such as order, data structure, and metadata, and to ensure the durability and availability for our collection approach Intended audience: This course is intended for students looking to increase their knowledge of data collection methods and techniques with Big Data solutions.
While there are no formal pre-requisites students will benefit from having a basic understanding of analytics services available in AWS. Recommended courses - Analytics Fundamentals https://cloudacademy.com/amazon-web-services/analytics-fundamentals-for-aws-course/
- Recognize and explain the operational characteristics of a collection system.
- Recognize and explain how a collection system can be designed to handle the frequency of data change and type of data being ingested.
- Recognize and identify properties that may need to be enforced by a collection system.
This course includes:
- 45 minutes of high-defnition videos
- Live hands-on demos
What You'll Learn:
- Introduction to Collecting Data: In this lesson we'll prepare you for what we'll be covering in the course; the Big Data collection services of AWS Data Pipeline, Amazon Kinesis and AWS Snowball.
- Introduction to Data Pipeline: In this lesson we'll discuss the basics of Data Pipeline.
- AWS Data Pipeline Architecture: In this lesson we'll go into more detail about the architecture that underpins the AWS Data Pipeline Big Data Service.
- AWS Data Pipeline Core Concepts: In this lesson we'll discuss how we define data nodes, access, activities, schedules and resources.
- AWS Data Pipeline Reference Architecture: In this lesson we'll look at a real life scenario of how data pipeline can be used.
- Introduction to AWS Kinesis: In this lesson we'll take a top level view of Kinesis and it's uses.
- Kinesis Streams Architecture: In this lesson we'll look at the architecture that underpins Kinesis.
- Kinesis Streams Core Concepts: In this lesson we'll dig deeper into the data records.
- Kinesis Streams Firehose Architecture: In this lesson we'll look at firehose architecture and the differences between it and Amazon Kinesis Streams.
- Firehose Core Concepts: Let's take a deeper look at some detals about the Firehose service.
- Kinesis Wrap-Up: In this summary we'll look at the differences between Kinesis and Firehose.
- Introduction to Snowball: Overview of the Snowball Service.
- Snowball Architecture: Let's have a look at the architecture that underpins the AWS Snowball big data service
- Snowball Core Concepts: In this lesson we'll look at the details of how Snowball is engineered to support data transfer.
- Snowball Wrap-Up: A brief summary of Snowball and our course.
Okay, let's have a look at the architecture that underpins the Amazon Kinesis Streams big data service. Amazon Kinesis Streams is based on a platform as a service style architecture, where you determine the throughput of the capacity you require, and the architecture and components are automatically provisioned, installed, and configured for you. You have no need or ability to change the way these architectural components are deployed. Unlike some of the other Amazon big data services, which have a container that the server sits within, for an example, a DB instance with an Amazon IDS, Amazon Kinesis doesn't.
The container is effectively the combination of the accounts, and the region you are provisioning the Kinesis Streams within. An Amazon Kinesis Stream is an ordered sequence of data records. A record is the unit of data in an Amazon Kinesis Stream. Each record in the stream is composed of a sequence number, a partition key, and a data blob. A data blob is the data of interest your data producer adds to a stream. The data records in the stream are distributed into shards. A shard is the base throughput unit of an Amazon Kinesis Stream. One shard provides a capacity of one megabyte per second of data input, and two megabytes per second of data output, and can support up to a thousand put records per second. You specify the number of shards needed when you create a stream. The data capacity of your stream is a function of the number of shards that you specify for that stream. The total capacity of the stream is the sum of the capacity of its shards. If your data rate increases, you can increase or decrease the number of shards allocated to your stream. The producers continuously push data to Kinesis Streams, and the consumers process the data in realtime.
For example, a web service sending log data to a stream is a producer. Consumers receive records from Amazon Kinesis Streams and process them. These consumers are known as Amazon Kinesis Streams applications. Consumers can store their results using an AWS service, such as Amazon DynamoDB, Amazon Redshift, or Amazon S3. An Amazon Kinesis application is a data consumer that reads and processes data from an Amazon Kinesis Stream, and typically runs on a fleet of EC2 instances. You need to build your applications using either the Amazon Kinesis API, or the Amazon Kinesis Client Library, or KCL.
Before we go into each option in detail, let's have a quick look at how AWS makes things easier for you. One of the great things about AWS is they always try and make things easy for you. So when you go to create a new Amazon Kinesis Stream definition in the console, there are a couple of simple parameters we need to complete to create the stream. We just need to enter in a stream name, and the number of shards, and then we are ready to go.
About the Author
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.