If you have data that needs to be subjected to analytics, then you will likely need to put that data through an extract, transform and load (ETL) process, AWS Glue is a fully managed service designed to do just this. Through a series of simple configurable options, you can select your source data to be processed by AWS Glue allowing you to turn it into cataloged, searchable and queryable data. This course will take you through the fundamentals of AWS Glue to get you started with this service
The objectives of this course are to provide you with and understanding of:
- Serverless ETL
- The knowledge and architecture of a typical ETL project
- The prerequisite setup of AWS parts to use AWS Glue for ETL
- Knowledge of how to use AWS Glue to perform serverless ETL
- How to edit ETL processes created from AWS Glue
This course is ideal for:
- Data warehouse engineers that are looking to learn more about serverless ETL and AWS Glue
- Developers that want to learn more about ETL work using AWS Glue
- Developer leads that want to learn more about the serverless ETL process
- Project managers and owners that want to learn about data preparation
As a prerequisite to this course you should have familiarity with:
- One ore more of the data storage destinations offered by AWS
- Data warehousing principles
- Serverless computing
- Object-orientated programming (Python)
We welcome all feedback and suggestions - please contact us at firstname.lastname@example.org if you are unsure about where to start or if would like help getting started.
Hello and welcome to this lecture where I shall provide an overview of AWS Glue. AWS Glue is a fully managed ETL serverless architecture and tool that makes it simple and cost effective to categorize your data, clean it, enrich it and move it reliably between various data sources. For developers who have used ETL tools before for moving data from source to destination, the AWS Glue UI will look familiar as it helps guide with tasks such as creating a connection and selecting and configuring data sources. The differences are mostly benefits as you'll see.
The following diagram shows the initial parts of storing metadata which is the first step before creating an AWS Glue ETL job. The crawler. AWS Glue crawlers connect to data stores while working for a list of classifiers that help determine the schema of your data and creates metadata for your AWS Glue Data Catalog. While the crawler will discover table schemers, it does not discover relationships between tables. The metadata is stored in the data catalog and used to help offering process for your ETL jobs. You can run crawlers on a schedule from a menu of options, on demand or create a custom cron job in a Linux based operating system and triggered based on an event such as the delivery of a new data file. By running the crawler, your metadata stored in your data catalog table is updated with items such as schema changes, such as new columns in the data source. You can create an AWS Glue crawler in its simplest by following the wizard and performing the following steps. Firstly, name your crawler. You must then choose a data store and include a path to it and here you might include aced glued patterns. Optionally, add another data store, select the IAM row or create a new one. Create the schedule for this crawler, configure the crawler's output and in this step, you must add or select an existing database which contains tables created by the crawler you are creating.
Now finally there are other configuration options which are important to highlight and I'll cover this in the demo following this lecture. You add a crawler within your data catalog to diverse your data stores. The output of the crawler consists of one or more metadata tables that are defined in your data catalog. Note that your crawler uses an AWS Identity and Access Management by IAM role for permission to access your data stores and the data catalog.
Classifiers. A classifier reads the data in a data store and given an output to include a string that indicates the file's classification or format. For example JSON and the schema of the file. AWS Glue provides built-in classifiers for various formats including JSON, CSV, web logs and many database systems. The CSV classifier checks for the following delimiters: a comma, pipe, tab and semicolon. You can include or exclude patterns to manage what the crawler will search for. For example you can exclude all objects that end with a CSV file extension or exclude specific folders in your S3 bucket. Regular expressions can also be used to exclude patterns. For custom classifiers, you can define the logic for creating the schema based on the type of classifier. You might need to determine a custom classifier if your data doesn't match any built-in classifiers or if you want to customize the tables that are created by the crawler.
Connections. A connection creates the properties needed to connect to your data. Connections are used by crawlers in jobs in AWS Glue to access certain types of data stores. AWS Glue can connect to the following data stores by using the JDBC protocol, Amazon Redshift and Amazon RDS including Amazon Aurora, MariaDB, Microsoft SQL Server, MySQL, Oracle and PostgreSQL. The data store you select when creating the crawler will infer the schema and consequently, the metadata that is collected and stored in the data catalog.
The data catalog. The AWS Glue Data Catalog is created when you run your crawler. It is a persistent metadata store for data assets and contains table definitions, job definitions and other control information to help you manage your AWS Glue environment. A schema version history for target data stores is kept so you can view how your data has changed over time. AWS Glue automatically generates code for the extract, transform and load steps of an ETL job when you complete the UI guide for creating a job. The code is generated in your choice between Scala and Python and is written for Apache Spark. Development endpoints are provided for you to edit, debug and test code that it generates for your ETL job. You can use your preferred IDE to write custom readers, writers or transformations to include in your AWS Glue ETL jobs as customer libraries.
AWS Glue jobs can be scheduled to run, run on-demand or triggered to run from an event. You can start multiple jobs to run in parallel or specify dependencies across jobs to build more complex ETL pipelines. Logs and notifications are pushed to Amazon CloudWatch so you can monitor, be alerted and troubleshoot jobs that have run. Here's an example of logs sent to CloudWatch from having run a crawler that I called MyCrawler. Let's now look at a diagram for how to create an AWS Glue job. Firstly, you choose a data source for your job. The tables that represent your data source must already be defined in your data catalog. If the source requires a connection, the connection is also referenced in your job. Remember that a connection contains the properties needed to connect to your data. If your job requires multiple data sources, you can add them later by editing the script. The script there is auto generated, is immediately available on completing the Job Creation Wizard. Next you choose a data target of your job. The tables that represent the data target can be defined in your data catalog or your job can create the target tables when it runs. You choose the target location when you author the job and if the target requires connection, the connection is also referenced in your job. If you job requires multiple data targets, you can add them later by again editing the script. Next you customize the job processing environment by providing arguments for your job and generated script. We will add a job with a demo in the next lecture. Initially, AWS Glue generates a script but you can also edit the script to add sources, targets and transforms. You specify how your job was invoked either on-demand by a ton based schedule or by an event. For more information, see the demo next that includes creating and applying triggers to jobs in AWS Glue.
Based on your input, AWS Glue generates Pyspark or Scala script. You can tailor the script based on your business needs. You'll be blocked if you don't have access to the data stores so you must use any of the following types of identities with encrypt permissions, an IAM user or an IAM role. In addition to using name and password, you can also generate access keys for each user. As a best practice, you do not want to use the root user for accessing data sources but should work to limit to only those data sources that are needed and maintain some governance with sensitive data. So what benefits does AWS Glue offer? Serverless. You pay for only resources while AWS Glue is running. Crawlers detects and infers schemas from data sources with very little configuring and crawling can be scheduled and can also trigger job runs to perform ETL. Auto code generation gives you what you need in Python or Scala code to run a simple job or to extend within your own code additions. The code integrates with outer toolchains via custom endpoints.
With AWS Glue, you pay an hourly rate billed by the second for crawlers which are discovering data and for ETL jobs which are processing and loading data. For the AWS Glue Data Catalog, you pay a simple monthly fee for storing and accessing the metadata. The first million objects stored are free and the first million accesses are free. If you provision a development endpoint to interactively develop your ETL code, you pay an hourly rate billed per second. Each DPU hour costs $0.44 in US East-One and a single DPU provides four virtual CPUs and 16 gig of memory. So when would you choose not to use AWS Glue? Like much of the technology, the how for performing ETL is opinionated. There are many other methods for ETL on premise and in the Cloud that meet the needs of developers and organizations. Your company or team may already have made significant investments using on prem or other tech for ETL pipelines and most ETL tools are still on premise and pipelines are created visually with the UI. Languages are limited to Python and Scala and jobs must be edited when schemas are updated. And your team may have an investment in Java or some other language in the case that you already do that development. AWS Glue is still relatively a young product and there are currently no third party out-of-the-box connectors such as sales force and others that you can expect in a more mature ETL tool set.
About the Author
Move a metric, change products or behaviors...with data -that is what excites me. I am passionate about data and have worked to architect and develop data solutions using cloud and on-premise ETL and visualization tools. I am an evangelist for self-service data transformation, insights, and analytics. I love to be agile.
I extend my understanding to the community by giving presentations at Big Data Conferences, Code Camp, and other venues. I also write useful content in the form of white papers, two books on business intelligence, and blog posts.