AWS Data Pipeline vs. AWS Glue
In this course, we will compare Amazon EMR and AWS Glue and cover ways to make ETL processes more automated and repeatable.
- What AWS Glue is and how it works
- How AWS Glue compares to Amazon EMR
- How to make ETL processes more automated and repeatable using orchestration services such as AWS Data Pipeline, AWS Glue Workflows, and AWS Step Functions
Those who are implementing and managing ETL on AWS
Those who are looking to take an AWS certification — specifically the AWS Certified Solutions Architect – Associate Certification or the AWS Certified Data Analytics - Specialty Certification
In this course, I will provide introductory information on AWS Glue. However, to get the most from this course, you should already have an understanding of Amazon EMR and Amazon EC2. For more information on these services, please see our existing content titled:
AWS Glue historically was only an ETL service. Since then, the service has turned into a suite of data integration tools. Now, AWS Glue is made up of four different services:
Glue Data Catalog
Glue DataBrew, and
Glue Elastic Views. Glue Elastic Views is out of scope for this content, so I won’t be talking about it in this lecture. If you’re interested in Glue Elastic Views, I will link a course specifically for that topic.
In this lecture, I’ll mainly focus on the Glue Data Catalog aspect of this service.
AWS defines the Glue Data Catalog as a central metadata repository. This means that it stores data about your data. This includes information like data format, data location, and schema. Here’s how it works:
You upload your data to storage like Amazon S3, or a database like Amazon DynamoDB, Amazon Redshift, or Amazon RDS. From there, you can use a Glue Crawler to connect to your data source, parse through your data, and then infer the column name and data type for all of your data. The Crawler does this by using Classifiers, which actually read the data from your storage. You can use built-in Classifiers or custom Classifiers you write to identify your schema.
Once it infers the schema, it will create a new catalog table with information about the schema, the metadata, and where the source data is stored. You can have many tables filled with schema data from multiple sources. These tables are housed in what’s called a database.
Note, that your data still lives in the location where you originally uploaded it, but now you also have a representation of the schema and metadata for that data in the catalog tables. This means your code doesn’t necessarily need to know where the data is stored and can reference the Data Catalog for this information instead.
That’s it for this one. See you soon!
Alana Layton is an experienced technical trainer, technical content developer, and cloud engineer living out of Seattle, Washington. Her career has included teaching about AWS all over the world, creating AWS content that is fun, and working in consulting. She currently holds six AWS certifications. Outside of Cloud Academy, you can find her testing her knowledge in bar trivia, reading, or training for a marathon.