Orchestrating ETL Workflows


Course Introduction
Amazon CloudWatch
AWS Config
What is AWS Config?
AWS Control Tower
AWS Control Tower
PREVIEW19m 56s
AWS Resource Access Manager
AWS Management
AWS Service Catalog
PREVIEW10m 34s
AWS Trusted Advisor Best Practices
AWS Health Dashboard
AWS Data Visualization
Finding Compliance Data with AWS Artifact
Observability in AWS
Start course
6h 2m

This section of the AWS Certified Solutions Architect - Professional learning path introduces the AWS management and governance services relevant to the AWS Certified Solutions Architect - Professional exam. These services are used to help you audit, monitor, and evaluate your AWS infrastructure and resources and form a core component of resilient and performant architectures. 

Want more? Try a Lab Playground or do a Lab Challenge!

Learning Objectives

  • Understand the benefits of using AWS CloudWatch and audit logs to manage your infrastructure
  • Learn how to record and track API requests using AWS CloudTrail
  • Learn what AWS Config is and its components
  • Manage multi-account environments with AWS Organizations and Control Tower
  • Learn how to carry out logging with CloudWatch, CloudTrail, CloudFront, and VPC Flow Logs
  • Learn about AWS data transformation tools such as AWS Glue and data visualization services like Amazon Athena and QuickSight
  • Learn how AWS CloudFormation can be used to represent your infrastructure as code (IaC)
  • Understand SLAs in AWS

ETL pipelines can be complex. You may be running multiple ETL jobs at once, at varying intervals of time (maybe hourly, daily, weekly), involving multiple AWS services. It’s important you have a service that not only triggers your pipeline to run but also automates the movement between services, while also handling basic retry logic and error handling. 

To do this, you can use orchestration services. There are three main orchestration services that can be used in combination with ETL services, like Amazon EMR and AWS Glue. These services are:

  • AWS Data Pipeline

  • AWS Step Functions

  • And surprisingly, AWS Glue. AWS glue has its own orchestration tool called Glue Workflows

Let’s start with the simplest of the three options: Glue Workflows. Glue Workflows provides a visual editor to create relationships between your Glue components, such as your Triggers, Crawlers, and your Glue ETL jobs. 

For example, let’s say I create a Workflow. This Workflow will first start with a trigger. I can trigger based off a schedule or an event. I want this Workflow to be triggered daily at 12:00. 

Once the workflow is triggered, it will kick off a job to do some light pre-processing of the data. After that is successful, I’ll have a crawler crawl the optimized data set. Once the crawler finishes running, I can then run ETL on that data. 

Glue will run this workflow every day at 12:00 without my intervention, completely automating my pipeline. 

Glue Workflows has only one drawback: it is very simplistic and can only be integrated with Glue tools. If you use other AWS services within your pipeline, and not just Glue, consider using a service that has better service integration, such as Data Pipeline or AWS Step Functions. 

There is no extra cost to Glue Workflows, however, you will pay for the Crawlers, the ETL jobs, and the Data Catalog requests that Workflows triggers on your behalf.  

Next, there’s Data Pipeline. Its sole purpose is to coordinate data processing from one service to another without human intervention. 

The service itself is pretty bare-bones, and because of this, it’s very simple in nature. A data pipeline is made up of three core components: 

  1. Data nodes: these are storage locations where you house your input data and output data. Data nodes can be S3, Redshift, DynamoDB, RDS or a JDBC connection. 

  2. Activities: this is the work that you want the pipeline to perform on your data. This could be a CopyActivity, that copies data to another location, it could be a SQL activity, that runs a SQL query on a database, or it could be an EMR activity, such as running an EMR cluster or running a Hive query or Pig script on an EMR cluster.

  3. Preconditions: these are conditional statements that must be true before an activity can run. For example, you can check whether a data node exists, or run a custom shell script before your activity runs. 

Data Pipeline also has retry functionality built-in to the service. You can configure up to 5 retries per activity. The pipeline won't report failure until it goes through the number of retries you set. The higher the number, the longer it will take. 

While AWS DataPipeline is simple to get started with, you may find that there are some limitations. For example, DataPipeline has limited data sources. While you might be able to hack around this, you may want to consider using a service called AWS Step Functions for further AWS service integration. 

This leads us to the last orchestration service: AWS Step Functions. While AWS Step functions isn’t purpose-built for working with data, it does work well with most general workflows. This generic nature provides more flexibility to the user.

With Step Functions, you can integrate with far more services, such as AWS Lambda, API Gateway, Athena, and more. You can call over 200 AWS services from your Step Functions workflows. It additionally can support pipelines that use Amazon EMR and AWS Glue, whereas DataPipeline only supports EMR. 

Step Functions coordinates the navigation among services in a serverless workflow and manages retries and errors. It is more robust than DataPipeline in terms of configuration, providing the ability to not only perform tasks but also embed simple logic for execution in your pipeline. This enables you to make choices between multiple states, pass data between services, use parallel execution, and implement delays in your pipeline. 

To get a feel of how it works let’s draw a quick example that uses Step Functions to orchestrate Glue ETL jobs. In this example, I upload my data to Amazon S3, which triggers Step Functions to run. Step Functions first signals to Lambda to validate my data in S3, to ensure that it is the right data type and schema. If the validation is successful, the data is moved to a staging folder. If the validation fails, it moves to an error folder and sends you a notification using Amazon SNS. 

For the successfully validated data, an AWS Glue Crawler runs to infer the schema. Step Functions then triggers a Glue ETL job to run, transforming the file into a different format. Once the glue job is complete, it stores the outputted data in the transformed folder in Amazon S3. I then receive an SNS message stating the ETL job has successfully finished. 

This is just an example of what you can do with Step Functions. You can build far more complex ETL processes that include a wide range of AWS services and logic. 

In summary, AWS Glue Workflows is great for creating workflows between different Glue Components. However, it’s not best if you need orchestration that includes other AWS services. 

While AWS Data Pipeline is a simplistic way of getting started with building data pipelines on AWS, it does have rigid limitations on what services integrate with it. AWS Step Functions has far greater integration with AWS and provides more sophisticated logic when building a pipeline. 

About the Author
Learning Paths

Danny has over 20 years of IT experience as a software developer, cloud engineer, and technical trainer. After attending a conference on cloud computing in 2009, he knew he wanted to build his career around what was still a very new, emerging technology at the time — and share this transformational knowledge with others. He has spoken to IT professional audiences at local, regional, and national user groups and conferences. He has delivered in-person classroom and virtual training, interactive webinars, and authored video training courses covering many different technologies, including Amazon Web Services. He currently has six active AWS certifications, including certifications at the Professional and Specialty level.