image
Orchestrating ETL Workflows
Start course
Difficulty
Beginner
Duration
5h 1m
Students
2669
Ratings
4.5/5
Description

This section provides detail on the AWS management services relevant to the Solution Architect Associate exam. These services are used to help you audit, monitor and evaluate your AWS infrastructure and resources.  These management services form a core component of running resilient and performant architectures. 

Want more? Try a lab playground or do a Lab Challenge!

Learning Objectives

  • Understand the benefits of using AWS CloudWatch and audit logs to manage your infrastructure
  • Learn how to record and track API requests using AWS CloudTrail
  • Learn what AWS Config is and its components
  • Manage your accounts with AWS Organizations, including single sign-on with AWS SSO
  • Learn how to carry out logging with CloudWatch, CloudTrail, CloudFront, and VPC Flow Logs
  • Understand how to design cost-optimized architectures in AWS
  • Learn about AWS data transformation tools such as AWS Glue and data visualization services like Amazon Athena and QuickSight
Transcript

ETL pipelines can be complex. You may be running multiple ETL jobs at once, at varying intervals of time (maybe hourly, daily, weekly), involving multiple AWS services. It’s important you have a service that not only triggers your pipeline to run but also automates the movement between services, while also handling basic retry logic and error handling. 

To do this, you can use orchestration services. There are three main orchestration services that can be used in combination with ETL services, like Amazon EMR and AWS Glue. These services are:

  • AWS Data Pipeline

  • AWS Step Functions

  • And surprisingly, AWS Glue. AWS glue has its own orchestration tool called Glue Workflows

Let’s start with the simplest of the three options: Glue Workflows. Glue Workflows provides a visual editor to create relationships between your Glue components, such as your Triggers, Crawlers, and your Glue ETL jobs. 

For example, let’s say I create a Workflow. This Workflow will first start with a trigger. I can trigger based off a schedule or an event. I want this Workflow to be triggered daily at 12:00. 

Once the workflow is triggered, it will kick off a job to do some light pre-processing of the data. After that is successful, I’ll have a crawler crawl the optimized data set. Once the crawler finishes running, I can then run ETL on that data. 

Glue will run this workflow every day at 12:00 without my intervention, completely automating my pipeline. 

Glue Workflows has only one drawback: it is very simplistic and can only be integrated with Glue tools. If you use other AWS services within your pipeline, and not just Glue, consider using a service that has better service integration, such as Data Pipeline or AWS Step Functions. 

There is no extra cost to Glue Workflows, however, you will pay for the Crawlers, the ETL jobs, and the Data Catalog requests that Workflows triggers on your behalf.  

Next, there’s Data Pipeline. Its sole purpose is to coordinate data processing from one service to another without human intervention. 

The service itself is pretty bare-bones, and because of this, it’s very simple in nature. A data pipeline is made up of three core components: 

  1. Data nodes: these are storage locations where you house your input data and output data. Data nodes can be S3, Redshift, DynamoDB, RDS or a JDBC connection. 

  2. Activities: this is the work that you want the pipeline to perform on your data. This could be a CopyActivity, that copies data to another location, it could be a SQL activity, that runs a SQL query on a database, or it could be an EMR activity, such as running an EMR cluster or running a Hive query or Pig script on an EMR cluster.

  3. Preconditions: these are conditional statements that must be true before an activity can run. For example, you can check whether a data node exists, or run a custom shell script before your activity runs. 

Data Pipeline also has retry functionality built-in to the service. You can configure up to 5 retries per activity. The pipeline won't report failure until it goes through the number of retries you set. The higher the number, the longer it will take. 

While AWS DataPipeline is simple to get started with, you may find that there are some limitations. For example, DataPipeline has limited data sources. While you might be able to hack around this, you may want to consider using a service called AWS Step Functions for further AWS service integration. 

This leads us to the last orchestration service: AWS Step Functions. While AWS Step functions isn’t purpose-built for working with data, it does work well with most general workflows. This generic nature provides more flexibility to the user.

With Step Functions, you can integrate with far more services, such as AWS Lambda, API Gateway, Athena, and more. You can call over 200 AWS services from your Step Functions workflows. It additionally can support pipelines that use Amazon EMR and AWS Glue, whereas DataPipeline only supports EMR. 

Step Functions coordinates the navigation among services in a serverless workflow and manages retries and errors. It is more robust than DataPipeline in terms of configuration, providing the ability to not only perform tasks but also embed simple logic for execution in your pipeline. This enables you to make choices between multiple states, pass data between services, use parallel execution, and implement delays in your pipeline. 

To get a feel of how it works let’s draw a quick example that uses Step Functions to orchestrate Glue ETL jobs. In this example, I upload my data to Amazon S3, which triggers Step Functions to run. Step Functions first signals to Lambda to validate my data in S3, to ensure that it is the right data type and schema. If the validation is successful, the data is moved to a staging folder. If the validation fails, it moves to an error folder and sends you a notification using Amazon SNS. 

For the successfully validated data, an AWS Glue Crawler runs to infer the schema. Step Functions then triggers a Glue ETL job to run, transforming the file into a different format. Once the glue job is complete, it stores the outputted data in the transformed folder in Amazon S3. I then receive an SNS message stating the ETL job has successfully finished. 

This is just an example of what you can do with Step Functions. You can build far more complex ETL processes that include a wide range of AWS services and logic. 

In summary, AWS Glue Workflows is great for creating workflows between different Glue Components. However, it’s not best if you need orchestration that includes other AWS services. 

While AWS Data Pipeline is a simplistic way of getting started with building data pipelines on AWS, it does have rigid limitations on what services integrate with it. AWS Step Functions has far greater integration with AWS and provides more sophisticated logic when building a pipeline. 

About the Author
Students
236921
Labs
1
Courses
232
Learning Paths
187

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.