AWS Data Pipeline vs. AWS Glue
In this course, we will compare Amazon EMR and AWS Glue and cover ways to make ETL processes more automated and repeatable.
- What AWS Glue is and how it works
- How AWS Glue compares to Amazon EMR
- How to make ETL processes more automated and repeatable using orchestration services such as AWS Data Pipeline, AWS Glue Workflows, and AWS Step Functions
Those who are implementing and managing ETL on AWS
Those who are looking to take an AWS certification — specifically the AWS Certified Solutions Architect – Associate Certification or the AWS Certified Data Analytics - Specialty Certification
In this course, I will provide introductory information on AWS Glue. However, to get the most from this course, you should already have an understanding of Amazon EMR and Amazon EC2. For more information on these services, please see our existing content titled:
A few years ago, Glue released another transformation tool called Glue DataBrew. On the surface, DataBrew looks very similar to Glue Studio. So, what is Glue DataBrew?
Glue DataBrew is a true no-code service for transforming data. Here’s how it works:
You first upload your data. You can upload it directly to the service, or connect to other data sources like Amazon S3, Amazon Aurora, Amazon Redshift, Glue Data Catalog, or other JDBC Connections. It can additionally connect to AppFlow, Data Exchange, and Snowflake.
Once you upload your data, you can preview your data in a visual interface. From there you can choose from hundreds of built-in transformations. Some of these transformations include formatting your data, modifying columns, working with duplicate or missing values, encoding data, and more.
Once you apply your transformation, you can store the output in Amazon S3. Note that Amazon S3 is the only place you can store your transformed data. So if both of these services provide transformations, function in similar ways, and if Glue Data Studio also provides some no-code options, which service do you use?
Well, there are four main differences between the two that might help you distinguish when to use each service:
Glue DataBrew is a no-code tool. Unlike Glue Studio, you can’t write your own custom code for transformations even if you wanted to. However, that means that DataBrew provides a lot more options for built-in transformations. DataBrew has over 250+ built-in transformations, while Glue Studio has around 10. These transformations are different as well. Glue Studio built-in transformations focus mostly on ETL, while DataBrew's transformations mostly prepare data for machine learning.
These services are meant for different audiences. Glue Studio is meant for ETL engineers and is focused on ETL itself, while Glue DataBrew is mostly for business analysts and data scientists that may not have coding experience. You don’t need specialized expertise to transform data with DataBrew.
Both services provide a graphical interface for visualizing your transformations. Glue Studio, however, is the only option that provides programmatic opportunities for working with ETL through Jupyter notebooks and shell scripts.
DataBrew has a profiling feature, which enables you to get statistics about your data. For example, with profiling, you can get information about how many rows you have in your data set or how many unique values you have in each column. Glue Studio does not have a data profiling feature. That’s it for this one - see you next time!
Alana Layton is an experienced technical trainer, technical content developer, and cloud engineer living out of Seattle, Washington. Her career has included teaching about AWS all over the world, creating AWS content that is fun, and working in consulting. She currently holds six AWS certifications. Outside of Cloud Academy, you can find her testing her knowledge in bar trivia, reading, or training for a marathon.