AWS Data Pipeline
In course one of the AWS Big Data Specialty Data Collection learning path we explain the various data collection methods and techniques for determining the operational characteristics of a collection system. We explore how to define a collection system able to handle the frequency of data change and the type of data being ingested. We identify how to enforce data properties such as order, data structure, and metadata, and to ensure the durability and availability for our collection approach
- Recognize and explain the operational characteristics of a collection system.
- Recognize and explain how a collection system can be designed to handle the frequency of data change and the type of data being ingested.
- Recognize and identify properties that may need to be enforced by a collection system.
This course is intended for students looking to increase their knowledge of data collection methods and techniques with big data solutions.
While there are no formal prerequisites, students will benefit from having a basic understanding of analytics services available in AWS. Please take a look at our Analytics Fundamentals for AWS
This Course Includes
- 45 minutes of high-definition videos
- Live hands-on demos
What You'll Learn
- Introduction to Collecting Data: In this lesson, we'll prepare you for what we'll be covering in the course; the Big Data collection services of AWS Data Pipeline, Amazon Kinesis, and AWS Snowball.
- Introduction to Data Pipeline: In this lesson, we'll discuss the basics of Data Pipeline.
- AWS Data Pipeline Architecture: In this lesson, we'll go into more detail about the architecture that underpins the AWS Data Pipeline Big Data Service.
- AWS Data Pipeline Core Concepts: In this lesson, we'll discuss how we define data nodes, access, activities, schedules, and resources.
- AWS Data Pipeline Reference Architecture: In this lesson, we'll look at a real-life scenario of how data pipeline can be used.
- Introduction to AWS Kinesis: In this lesson, we'll take a top-level view of Kinesis and its uses.
- Kinesis Streams Architecture: In this lesson, we'll look at the architecture that underpins Kinesis.
- Kinesis Streams Core Concepts: In this lesson, we'll dig deeper into the data records.
- Kinesis Streams Firehose Architecture: In this lesson, we'll look at firehose architecture and the differences between it and Amazon Kinesis Streams.
- Firehose Core Concepts: Let's take a deeper look at some details about the Firehose service.
- Kinesis Wrap-Up: In this summary, we'll look at the differences between Kinesis and Firehose.
- Introduction to Snowball: Overview of the Snowball Service.
- Snowball Architecture: Let's have a look at the architecture that underpins the AWS Snowball big data service
- Snowball Core Concepts: In this lesson, we'll look at the details of how Snowball is engineered to support data transfer.
- Snowball Wrap-Up: A brief summary of Snowball and our course.
Okay, let's have a look at the core concepts that underpin the AWS Snowball big data service. AWS Snowball have been engineered to support data transfer over RJ45, SFP + Copper, or SFP + Optical, 10 gigabit Ethernet. The Snowball client is software that you install on a local host computer, and use to efficiently identify, compress, encrypt, and transfer data from the directories you specify to the Snowball appliance.
The Snowball client is a standalone terminal application that you run on your local workstation to do your date transfer. You don't need any coding knowledge to use the Snowball client. It provides all the functionality you need to transfer data, including handling errors, and writing logs to your local workstation for troubleshooting and auditing. The workstation is the computer, server, or virtual machine that hosts your mounted data source. You'll connect the Snowball to this workstation to transfer your data. Because the workstation is considered the bottleneck for transferring data between the Snowball and the data source, we highly recommend that your workstation is a powerful computer, able to meet high demands for processing, memory, and networking.
AWS recommend that your workstation be a computer dedicated to the task of running the Snowball client or Amazon S3 Adapter for Snowball while you are transferring data. Each instance of the client or the adapter requires up to seven gigabytes of dedicated RAM for memory-intensive tasks, such as performing encryption. AWS Snowball Edge is a hundred terabyte data transfer device with onboard storage and compute capabilities. You can use Snowball Edge to move large amounts of data into and out of AWS as a temporary storage tier for large local data sets, or to support local workloads in remote or offline locations.
AWS Snowball Edge appliances have about 73 terabytes of usable space. The Snowball Edge connects to your existing applications and infrastructure using standard storage interfaces, streamlining the data transfer process and minimizing set-up and integration. Snowball Edge can be clustered together to form a local storage tier and process your data on prints, helping to ensure your applications continue to run, even when they're not able to access the cloud. Clusters have anywhere from five to 10 AWS Snowball Edge appliances called nodes. When you receive the nodes from your regional carrier, you will need to choose one node as the leader node, and the other four as the secondary node. This choice is up to you.
The total available storage is 45 terabytes of data per node in the cluster. Thus in a five node cluster there is 225 terabytes of available storage space. In contrast, there's hundred terabytes of available storage in a standalone Snowball Edge. Clusters that have more than five nodes have even more storage space. A quorum represents the minimum number of Snowball Edge devices in a cluster that must be communicating with each other to maintain some level of operation.
There are two level of quorum for Snowball Edge clusters: a read/write quorum and a read quorum. Let's say you've uploaded your data to a cluster of Snowball Edge devices. With all devices healthy, you have a read/write quorum in your cluster. If one nodes goes offline, you have reduced the operational capacity of the cluster. However, you can still read and write to the cluster. In that sense, with the cluster operating all but one node, the cluster still has a read/write quorum. In this same example, if the external power failure took out two of the nodes in your cluster, any additional or ongoing write operations fail.
But any data that was successfully written to the cluster can be accessed and read. This situation is called a read quorum. You can write data to an unlocked cluster by using the Amazon S3 adapter for Snowball, or the NFS mount point through the leader node, and it will distribute the data amongst the other nodes. As a best practice for large data transfers involving multiple jobs, AWS recommend that you separate your data into a number of smaller, manageable data transfer segments.
If you separate the data this way, you can transfer each segment one at a time, or multiple segments in parallel. You can make 9 segments of 10 terabytes each for an AWS Snowball Edge appliance. When you're done with your cluster, you'll ship all nodes back to AWS. Once they receive the returned cluster node, they will perform a complete erasure of the Snowball appliance. This erasure follows the National Institute of Standards and Technology 800 slash 88 standards.
The Snowball client is a standalone terminal application that you run on your local server to unlock the appliance and get credentials, logs, and status information. You can also use the client for administrative tasks for a cluster. The Amazon S3 Adapter for Snowball allows you to programmatically transfer data to and from the AWS Snowball Edge appliance using Amazon S3 REST API actions. This Amazon S3 REST API support is limited to a subset of actions. You can use this subset of actions with one of the AWSDKs to transfer data programmatically.
You can also use the subset of supported AWS Command Line Interfaces, the AWS CLI commands for Amazon S3 to transfer data programmatically. Using the file interface, you can drag and drop files from your computer onto Amazon S3 buckets on the Snowball Edge. The file interface exposes a network file system, or NFS mount point, for each bucket of your AWS Snowball Edge appliance. You can mount the file share from your NFS client using standard Linux, Microsoft Windows, or Mac commands.
You can use standard file operations to access the file share. The local compute functionality is AWS Lambda powered by AWS Greengrass, and can automatically run Python-language code in response to Amazon S3 PUT object action API calls to the AWS Snowball Edge appliance. These API calls are made through the Amazon S3 Adapter for Snowball. There are a number of limits within the AWS Snowball service you need to be aware of. Primary limit is that the AW Snowball appliances are limited to a small number of regions across the Amazon network. Those regions are listed in the table on the screen.
Also note that Snowball doesn't support international shipping, so if you're based in New Zealand like I am, then you're pretty much out of luck. Last thing to know is for security purposes, data transfers must be completed within 90 days of the Snowball being prepared.
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.