CloudAcademy

Data Pipeline

The course is part of this learning path

AWS Big Data – Specialty Certification Preparation for AWS
course-steps 14 lab-steps 4 quiz-steps 4

Contents

keyboard_tab
Introduction
play-arrow
Start course
Overview
DifficultyBeginner
Duration1h 20m
Students2130

Description

In this course, we will explore the Analytics tools provided by AWS, including Elastic Map Reduce (EMR), Data Pipeline, Elasticsearch, Kinesis, Amazon Machine Learning and QuickSight which is still in preview mode.

We will start with an overview of Data Science and Analytics concepts to give beginners the context they need to be successful in the course. The second part of the course will focus on the AWS offering for Analytics, this means, how AWS structures its portfolio in the different processes and steps of big data and data processing.

As a fundamentals course, the requirements are kept simple so you can focus on understanding the different services from AWS. But, a basic understanding of the following topics is necessary:

- As we are talking about technology and computing services, general IT knowledge is necessary, by general IT I mean the basics about programming logic, algorithms, and learning or working experience in the IT field.
- We will give you an overview of data science concepts, but if these concepts are already familiar to you, it will make your journey smoother.
- It is not mandatory but it would be helpful to have a general knowledge about AWS, most specifically about how to access your account and services such as S3 and EC2.

I would like to recommend that you take 2 courses from our portfolio that can help you better understand the AWS basics if you are just beginning with it:

AWS Technical Fundamentals
AWS Networking Fundamentals

If you have thoughts or suggestions for this course, please contact Cloud Academy at support@cloudacademy.com.

Transcript

Welcome to the AWS Analytics Fundamental course. In this video, we are going to talk about AWS Data Pipeline. In the end of this video, we will be able to understand the basics from AWS Data Pipeline, know the difference between Data Pipeline and other workflow services, and know the console settings.

The AWS Data Pipeline is an AWS service that provides data-driven workflows to automate big data jobs. In these workflows, we can find a pipeline composed of the data services. The tasks or business logic involved, and a schedule on which your business logic executes. For example, imagine you have a website that has been configured to ship the WebEx's logs to S3. So you could find a job that runs every hour, get these logs from S3, processes them on an EMR cluster, and load the EMR results back to a SQL database for further lookup.

The AWS Data Pipeline in this case could be responsible for all the schedule, the launch of the EMR cluster, run the needed resources to save the content back on a relational database, and return immediate report to you. In this specific context, the Data Pipeline service will do the job scheduling, execution, and retry logic. The tracking from dependents is between the different tasks sent to fail notifications or success notifications, and create all the resources and managing temporary computer resources. This is very handy when you have daily, weekly, or even monthly tasks that are done over and over again. So Data Pipeline reduces greatly the complexity and automates repetitive tasks for you.

As we are talking about workflows, you might have thought about what's the difference among AWS Data Pipeline and the well-known Amazon Simple Workflow Service or SWF. Both are workflow mechanisms, but the difference lies on the purpose of each of these workflows. The simple workflow service is a computer based workflow. For example, a credit card approval process in a web store checkout.

In this case, when you buy something and submit your request for payment and approval, a new workflow is started in the background, checks for the storage for your items, then gets your credit card details and submit them to the payment gateway. If an approval is received, then it puts your order into the shipping department, and ends the workflow, sending back to you information that your request has been approved and now is in the shipping area. This is an example of a compute based workflow.

On the other hand, the data-driven workflows has its focus on the data, not in the task itself. So it shifts the data around several platforms, where the data's transformed into its final destination. Let's keep our example for the web log analytics. Our data-driven workflow in this case, we run at scheduled times, each hour get a generated log from S3, start a task that we create an EMR cluster, submit a job to EMR, wait until the job finishes, and when finished, it will send us notification through SNS. For example, telling that a job has been completely ended.

Now let's get another example. In the image below, we are using AWS Data Pipeline taught to make the log shipping process. This is a little bit different from the previous example I've told you. This pipeline has two tasks or activities that are run by the task runner on the scheduled times. As you can see on each hour, it copies the EC2 logs from a specific service, like a web service, to an S3 bucket. So it's available for all the AWS services. And once a week, another task is run to create an EMR cluster and submit a batch job to analyze the logs on S3 and deliver a report for top URL access, most common origins based on IP address and so on. AWS Data Pipeline is free of charge. You can build your workflow without incurring any charges, but you pay for the resources you use.

Now we're going to explore the Amazon console for the Data Pipeline service. Okay, so here are in the AWS console. Let's play a little bit around with the Data Pipeline service. I click here. If it's the first time you are assessing to show to get started, which is a different page from this one. Here I have one pipeline working. So one pipeline which has been configured. We just create a new pipeline. As always, you have to put a name on your pipeline, TestPipeline. You can choose a template from AWS or import a definition, which is an advanced setting, or the easiest and most common way, use the architect. Let's choose this option here. You define a schedule or set it to run on the pipeline activation. But put just this first option just for testing here. You can enable logging. For productive workloads, it's highly recommended to keep logging enabled and set it to an S3 bucket. But, as we're just testing I keep it disabled. You put also tagging, and now we can edit on the architect.

The architect allows us to create pipelines in an easy way. Here we can see everything starts with the default configuration. This contains the settings that we set before, like the schedule on demand and also the configuration that we have no logging. I'm going to change here to a pipeline which is more complete, so we can explore a bit deeper the resources. This pipeline here does a very simple task. It basically copies S3 data from one bucket to another. Let me just open it here. Let's edit so we can see the architect from a real pipeline. As we can see in this pipeline, besides the default configuration, we have activities, we have all the data nodes here, a schedule, resources, and others. What do they mean? If you look to the graphical diagram, you'll see that in the middle we have the CopyActivity, which is the initial point, so we first need to have the activity set. In the activity we define our schedule where we would run, and also the output and destination data nodes. So we define first in the activity all the resources we are going to need. Later, we have to configure the data nodes, which are the nodes responsible for the input data and output data, where I will get my data from and where I will store my data. In our case, our data nodes contains an S3 bucket configuration, the source and the destination bucket.

So our CopyActivity then will have a resource which will tell what to do with the data. In this case, we have a copy data instance, so this EC2 resource, this instance will do the copy and perform this task to us, and we will have the others, which in this case represented by an alarm. We'll have an SNS alarm which will send in a successful copy, a message to us.

For pricing, as we said before, we have no pricing this pipeline configuration. What you pay is for the resources we use. For example, the S3 data transfer, the S3 data on rest, and as well the EC2 instance which will do the copy for us.

It's a very simple pipeline. Real world pipelines usually are very big and quite complex to set, but after you have the configuration done, it just works hopefully and will automate tasks which would take hours to be done each time, so that's the great benefit. Well, thank you for watching this video. That was the overview I'd like to provide at this time, and see you in the next one. Bye.

About the Author

Fernando has a solid experience with infrastructure and applications management on heterogeneous environments, working with Cloud-based solutions since the beginning of the Cloud revolution. Currently at Beck et al. Services, Fernando helps enterprises to make a safe journey to the Cloud, architecting and migrating workloads from on-premises to public Cloud providers.