1. Home
  2. Training Library
  3. Amazon Web Services
  4. Courses
  5. Automated Data Management with EBS, S3, and Glacier

Adding DataPipeline to the mix


Getting started
Start course

Data management is a key part of the infrastructure of most organizations, especially those dealing with large data stores. For example, imagine a team involved in scientifical analysis of data: they probably require a system to store the raw data in, another to analyze chunks of data quickly and cost-efficiently, and long-term archival to keep both the raw data and the result of their computation. In cases like that, it's important to deploy an automated system that can move data efficiently with integrated automatic backups.

In this course, the experienced System Administrator and Cloud Expert David Clinton will talk about implementing such a data management and backup system using EBS, S3 and Glacier, and taking advantage of the S3 LifeCycle feature and of DataPipiline for the automation of data transfers among the various pieces of the infrastructure. This system can be enabled easily and cheaply, as is shown in the last lecture of the course.

Who should take this course

As a beginner-to-intermediate course, some basic knoweldge of AWS is expected. A basic knowledge of programming is also needed to follow along the Glacier lecture. In any case, even those who are totally newcomers to these topics should be able to grasp at least the key concepts. 

If you want to learn more about the AWS solutions discussed in this course, you might want to check our other AWS courses. Also, if you want to test your knowledge on the basic topics covered in this course, we strongly suggest to take our AWS questions. You will learn more about every single services cited in this course. 

If you have thoughts or suggestions for this course, please contact Cloud Academy at support@cloudacademy.com.


Hi, and welcome to CloudAcademy.com's video series on Data Management. In this video we're going to explore backing up data resources using the AWS Data Pipeline service. Let's create a new pipeline.

We'll call it mybackups and leave the description blank for now. We'll only change one detail from the default, and that is we'll disable login simply for simplicity sake.

Now the console gives us a graphic representation of the elements of this pipeline. First of all add an activity.

The configuration for this default activity icon can be found in the panel on the right side. There we could rename it, but we'll leave it the way it is for now.

Very important, we'll select an activity type. In this case it's a copy activity, we're going to be backing up some data.

We'll need to copy from an input file to an output location, an output address. The input file will be a new data node. The output address will also be a new data node, both of which have now been added to our console. Let's just separate them. For some reason they always leave them piled one on top of the other. So we now have data node one which will be input to the activity, which will then output as a result the contents to data node two. Now in data node one, let's select a type.

It's going to be an S3 data node because we're going to be copying data from S3. The only field we have to add for our simple example is a file path so that Data Pipeline knows where this data is. We're going to enter the address of our S3 bucket and the data file that we intend to backup. Just for simplicity sake I've copied and pasted it because it is a long address, and it's kind of hard for you to see exactly what it looks like, but it starts with S3:// and then the Elastic Beanstalk address that you're given, followed at the end by a slash, and then maybe we'll go over to take a look, the word information, which happens to be the file name, which I'd like to copy. We're not going to add anything else to default data node one. We will however have to add some information to default data node two.

First of all what type is this again, S3 data node because we're simply copying. We're not doing anything to this data, just copying it from one place to another.

We'll again the file path where this file, where this data will go. Once again for simplicity sake I've cut and paste the information just because it is a lot of typing. And that's it for our data nodes.

Next we'll have to create a schedule. The name of the schedule let's say will be myschedule. The type, schedule, that's the only option we're given. Start at the first activation date.

The period will be every one day. Now let's go back to activities for a moment and add two more fields. The first will be runs on, where we will select default resource one. The second new field will be on success, where we will select default action one. Let's now open resources. All we have to change in the resources definition is the type. We'll move that to EC2 resource. Now let's click on others. We'll look at default action one, select type SNS alarm, which will send messages to our account whenever a Pipeline task is run or even if it fails.

We'll have to add some descriptive message. Let's say "Pipeline activity", and a subject, say, "Alert". These are just to make it easier for us to identify what this message is referring to. The rather complex element of this is the topic ARN. To create a topic ARN, go down in the Amazon Web Services Dashboard to mobile services, and then SNS, click on SNS. Let's make sure up in the top right corner that the region we're currently set to is the region where the rest of our services are being hosted, Eastern United States. And in this case that's fine.

Let's click on create new topic. Let's give the topic a name and a display name, Pipeline activity perhaps, and create the topic.

This is the topic ARN that Pipeline is looking for to populate that field. Let's come back into the Pipeline panel, and paste that alert topic into the topic dialog box. We appear to be pretty much ready to go. I'm now going to click on save Pipeline. It will obviously save our configuration, but more importantly it will alert us to any errors or warnings that are associated with our configuration. And to be honest, I don't think I've ever clicked on save Pipeline and not encountered errors. And probably this time will be no exception.

In fact there are errors or warnings. The default in our system, our configuration default there's a warning. That means that the Pipeline will probably work, but there's a configuration detail that should be or could be a little bit better.

Default resource one also has a warning, which you can read and take action on if you'd like. But remarkably, there are no errors. And the template should work as designed. Besides the console, you can also interface with Data Pipelines by way of the AWS command line interface or AWS Data Pipeline command line interface, which is a CLI, Command Line Interface written in Ruby that will make JSON calls to control and initiate activities with Data Pipeline. You can use the AWS software development kits, the SDK, which will allow you to access the Amazon APIs that are related to Data Pipeline, or use the Web Service API, again using JSON based programming tools to access the API.

About the Author
David Clinton
Linux SysAdmin
Learning Paths

David taught high school for twenty years, worked as a Linux system administrator for five years, and has been writing since he could hold a crayon between his fingers. His childhood bedroom wall has since been repainted.

Having worked directly with all kinds of technology, David derives great pleasure from completing projects that draw on as many tools from his toolkit as possible.

Besides being a Linux system administrator with a strong focus on virtualization and security tools, David writes technical documentation and user guides, and creates technology training videos.

His favorite technology tool is the one that should be just about ready for release tomorrow. Or Thursday.