Amazon Managed Workflows for Apache Airflow

Amazon Managed Workflows for Apache Airflow
Overview
Difficulty
Intermediate
Duration
15m
Students
64
Ratings
5/5
starstarstarstarstar
Description

This course delves into Amazon-managed workflows for Apache Airflow (MWAA). This is a great service for anyone already using Apache Airflow, and wanting to find a better way to deal with setting up the service, scheduling, and managing their workflow.

Learning Objectives

  • Understand how Amazon-managed workflows for Apache Airflow is implemented within AWS
  • Learn about DAGs (Directed Acyclic Graphs), which Apache Airflow uses to run your workflows
  • Understand the key components required to set up your own Managed Airflow environment

Intended Audience

This is a great service for anyone already using Apache Airflow, and wanting to find a better way to deal with setting up the service, scheduling, and managing their workflow.

Prerequisites

To get the most out of this course, you should have a decent understanding of cloud computing and cloud architectures, specifically with Amazon Web Services. You should also have some background knowledge about Apache Airflow, however, that is not a hard requirement. Basic knowledge of ELT pipelines and state machines would also be beneficial.

Transcript

Data scientists, Architects, and dev-ops engineers have been using Apache Airflows, a leading open-source orchestration environment, for creating and executing workflows that can help define ETL jobs, machine learning data pipelines, and even deal with complicated DevOps tasks. 

Amazon-managed workflows for Apache Airflow is a managed service that was designed to help you integrate Apache Airflow straight into AWS with minimal setup and the quickest time to execution. Like any managed service, AWS is in charge of scaling, managing, and helping you to orchestrate your Apache airflow environments.

Amazon-managed workflows for apache airflow (MWAA) also deals with managing worker fleets, installing dependencies, scaling the system up and down, logging and monitoring, providing authorization through IAM, and even single sign-on. 

Having a managed workflow system such as this, will help eliminate these hands-on operations and gives your engineers, and data scientists, more time to focus on the tasks that they are good at.

Before we get started talking about Amazon-managed Workflows for Apache Airflow, I think we should level set a little and talk about just what Apache Airflow is.

Apache Airflow is an open-source platform built around the idea of programmatically authoring, automating, and scheduling workflow systems.  These workflows are commonly associated with the collection of data from complex data pipelines, however, there are many use cases around this technology.  

You can use Apache airflow to help deal with:

  • Backups and generic DevOps tasks
  • Training machine learning models
  • Automatically generating reports
  • And Creation of robust ETL pipelines that can extract data from multiple sources

If you are unfamiliar with what a workflow is, you might think of it as a simple state machine. A state machine has a set of parameters that it watches for and makes decisions based on those inputs. Your workflows can be as simple or as complex as you desire. Apache airflow is used by a multitude of large corporations, small businesses, and hobby developers. 

Each workflow you build needs to be represented as a Directed Acyclic Graph, also known as a DAG. A DAG helps you visually and programmatically manage and structure the processes within your workflows.

Your DAGs are a collection of all the tasks you want apache airflow to run on your behalf. These DAGs are written in python, a fairly easy programming language to get started with, and helps you define the relationships and dependencies within your workflow.  You can have as many or as few DAGs as you desire.

A simple DAG might only consist of a few tasks - and you are in charge of how you want these tasks executed. For example, if our simple DAG has Task A, Task B, and Task C - you might configure your workflow so that Task A has to successfully complete before Task B can run. However, you might have Task C run whenever it can.

As you can see, the DAG helps to describe how the workflow operates, yet, it doesn't actually do the operations themselves. Tasks A, B, and C could be anything - the DAG is not concerned with what the task is, only about the order of operations that will be performed.

The tasks that your DAG can orchestrate are defined by Airflow Operators - here are a few examples.

There is a bash operator for you to run a bash script, a hive operator for working with hive, a MySQL operator for inputting things into a database, and here is a python operator for running some python code.

For the extensive list, please take a look over here: https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/index.html

About the Author

William Meadows is a passionately curious human currently living in the Bay Area in California. His career has included working with lasers, teaching teenagers how to code, and creating classes about cloud technology that are taught all over the world. His dedication to completing goals and helping others is what brings meaning to his life. In his free time, he enjoys reading Reddit, playing video games, and writing books.