Orchestrating ETL Workflows

Contents

keyboard_tab
Start course
Difficulty
Intermediate
Duration
28m
Students
31
Ratings
5/5
starstarstarstarstar
Description

In this course, we will compare Amazon EMR and AWS Glue and cover ways to make ETL processes more automated and repeatable.

Learning Objectives

  • What AWS Glue is and how it works 
  • How AWS Glue compares to Amazon EMR 
  • How to make ETL processes more automated and repeatable using orchestration services such as AWS Data Pipeline, AWS Glue Workflows, and AWS Step Functions

Intended Audience

  • Those who are implementing and managing ETL on AWS

  • Those who are looking to take an AWS certification — specifically the AWS Certified Solutions Architect – Associate Certification or the AWS Certified Data Analytics - Specialty Certification

Prerequisites 

In this course, I will provide introductory information on AWS Glue. However, to get the most from this course, you should already have an understanding of Amazon EMR and Amazon EC2. For more information on these services, please see our existing content titled: 

Transcript

ETL pipelines can be complex. You may be running multiple ETL jobs at once, at varying intervals of time (maybe hourly, daily, weekly), involving multiple AWS services. It’s important you have a service that not only triggers your pipeline to run but also automates the movement between services, while also handling basic retry logic and error handling. 

To do this, you can use orchestration services. There are three main orchestration services that can be used in combination with ETL services, like Amazon EMR and AWS Glue. These services are:

  • AWS Data Pipeline

  • AWS Step Functions

  • And surprisingly, AWS Glue. AWS glue has its own orchestration tool called Glue Workflows

Let’s start with the simplest of the three options: Glue Workflows. Glue Workflows provides a visual editor to create relationships between your Glue components, such as your Triggers, Crawlers, and your Glue ETL jobs. 

For example, let’s say I create a Workflow. This Workflow will first start with a trigger. I can trigger based off a schedule or an event. I want this Workflow to be triggered daily at 12:00. 

Once the workflow is triggered, it will kick off a job to do some light pre-processing of the data. After that is successful, I’ll have a crawler crawl the optimized data set. Once the crawler finishes running, I can then run ETL on that data. 

Glue will run this workflow every day at 12:00 without my intervention, completely automating my pipeline. 

Glue Workflows has only one drawback: it is very simplistic and can only be integrated with Glue tools. If you use other AWS services within your pipeline, and not just Glue, consider using a service that has better service integration, such as Data Pipeline or AWS Step Functions. 

There is no extra cost to Glue Workflows, however, you will pay for the Crawlers, the ETL jobs, and the Data Catalog requests that Workflows triggers on your behalf.  

Next, there’s Data Pipeline. Its sole purpose is to coordinate data processing from one service to another without human intervention. 

The service itself is pretty bare-bones, and because of this, it’s very simple in nature. A data pipeline is made up of three core components: 

  1. Data nodes: these are storage locations where you house your input data and output data. Data nodes can be S3, Redshift, DynamoDB, RDS or a JDBC connection. 

  2. Activities: this is the work that you want the pipeline to perform on your data. This could be a CopyActivity, that copies data to another location, it could be a SQL activity, that runs a SQL query on a database, or it could be an EMR activity, such as running an EMR cluster or running a Hive query or Pig script on an EMR cluster.

  3. Preconditions: these are conditional statements that must be true before an activity can run. For example, you can check whether a data node exists, or run a custom shell script before your activity runs. 

Data Pipeline also has retry functionality built-in to the service. You can configure up to 5 retries per activity. The pipeline won't report failure until it goes through the number of retries you set. The higher the number, the longer it will take. 

While AWS DataPipeline is simple to get started with, you may find that there are some limitations. For example, DataPipeline has limited data sources. While you might be able to hack around this, you may want to consider using a service called AWS Step Functions for further AWS service integration. 

This leads us to the last orchestration service: AWS Step Functions. While AWS Step functions isn’t purpose-built for working with data, it does work well with most general workflows. This generic nature provides more flexibility to the user.

With Step Functions, you can integrate with far more services, such as AWS Lambda, API Gateway, Athena, and more. You can call over 200 AWS services from your Step Functions workflows. It additionally can support pipelines that use Amazon EMR and AWS Glue, whereas DataPipeline only supports EMR. 

Step Functions coordinates the navigation among services in a serverless workflow and manages retries and errors. It is more robust than DataPipeline in terms of configuration, providing the ability to not only perform tasks but also embed simple logic for execution in your pipeline. This enables you to make choices between multiple states, pass data between services, use parallel execution, and implement delays in your pipeline. 

To get a feel of how it works let’s draw a quick example that uses Step Functions to orchestrate Glue ETL jobs. In this example, I upload my data to Amazon S3, which triggers Step Functions to run. Step Functions first signals to Lambda to validate my data in S3, to ensure that it is the right data type and schema. If the validation is successful, the data is moved to a staging folder. If the validation fails, it moves to an error folder and sends you a notification using Amazon SNS. 

For the successfully validated data, an AWS Glue Crawler runs to infer the schema. Step Functions then triggers a Glue ETL job to run, transforming the file into a different format. Once the glue job is complete, it stores the outputted data in the transformed folder in Amazon S3. I then receive an SNS message stating the ETL job has successfully finished. 

This is just an example of what you can do with Step Functions. You can build far more complex ETL processes that include a wide range of AWS services and logic. 

In summary, AWS Glue Workflows is great for creating workflows between different Glue Components. However, it’s not best if you need orchestration that includes other AWS services. 

While AWS Data Pipeline is a simplistic way of getting started with building data pipelines on AWS, it does have rigid limitations on what services integrate with it. AWS Step Functions has far greater integration with AWS and provides more sophisticated logic when building a pipeline. 

About the Author

Alana Layton is an experienced technical trainer, technical content developer, and cloud engineer living out of Seattle, Washington. Her career has included teaching about AWS all over the world, creating AWS content that is fun, and working in consulting. She currently holds six AWS certifications. Outside of Cloud Academy, you can find her testing her knowledge in bar trivia, reading, or training for a marathon.