Amazon EMR vs. AWS Glue for ETL

Contents

keyboard_tab
Start course
Difficulty
Intermediate
Duration
28m
Students
31
Ratings
5/5
starstarstarstarstar
Description

In this course, we will compare Amazon EMR and AWS Glue and cover ways to make ETL processes more automated and repeatable.

Learning Objectives

  • What AWS Glue is and how it works 
  • How AWS Glue compares to Amazon EMR 
  • How to make ETL processes more automated and repeatable using orchestration services such as AWS Data Pipeline, AWS Glue Workflows, and AWS Step Functions

Intended Audience

  • Those who are implementing and managing ETL on AWS

  • Those who are looking to take an AWS certification — specifically the AWS Certified Solutions Architect – Associate Certification or the AWS Certified Data Analytics - Specialty Certification

Prerequisites 

In this course, I will provide introductory information on AWS Glue. However, to get the most from this course, you should already have an understanding of Amazon EMR and Amazon EC2. For more information on these services, please see our existing content titled: 

Transcript

In this video, I’ll be comparing Amazon EMR vs AWS Glue for ETL. Before we get deeper into the two services, it’s important to note that Amazon EMR does have multiple deployment options. You can use EMR on EC2, on EKS, or use EMR serverless. In this video, I’ll be focusing mostly on EMR on EC2, with some mentions to EMR serverless. 

With that being said, our next fight for the evening is: AWS Glue taking on Amazon EMR. Who will win? 

In one corner, we have Amazon EMR, a big data platform that’s designed not only for ETL but also for machine learning and data analysis. In the other corner, we have AWS Glue, a data integration service, that provides Glue Studio for ETL. It also includes a Data Catalog, Glue DataBrew for no-code transformations, and Glue Elastic Views. 

Let’s look at these two services from three different perspectives

  1. Ease of use 

  2. Pricing

  3. Limitations

Ease of use for any tool in AWS is often inversely related to control. Tools that AWS says are “ easy to use” generally provide the user with less control over the service. The same is true the other way around, tools that provide a lot of control are typically more complex to use. 

You can see this clearly with EMR and Glue. For example, EMR on EC2 provides maximum control over the service. You can optimize, manage, and scale your cluster and compute nodes. You can take advantage of EC2 instance types, sizes, and pricing options such as Spot, Reservations, and Savings Plans. You can install a wide range of open-source tools, such as Hive, Presto, HBase, Spark, and more to fit your use case. And you can choose how long you run your EMR cluster. It could be a longer running cluster that is available 24/7, or it could be a transient cluster that’s provisioned, runs the proposed jobs, and then terminates soon after. You have control over all of it. 

However, the freedom of choice can make the service more complex to manage. Configuring and maintaining the engine and the cluster can be a full-time job. You may need to dedicate resources in your engineering teams to manage this underlying infrastructure.

With Glue, your choices decrease because it is serverless. You no longer get to manage the underlying EC2 instances and storage. All cluster, node, and engine maintenance disappears. From the infrastructure maintenance perspective, it is simpler. However, that also means you no longer get to choose EC2 instance types, sizes, or pricing options. And you also no longer get to choose from a range of open-source engines. Glue can only run your ETL jobs in an Apache Spark environment. The other factor is that Glue terminates as soon as your job finishes executing, so support for longer-running clusters is not possible.

The convenience of serverless is helpful, but there is a price for this convenience. Glue is, at face value, more expensive than EMR on EC2. This is a common tradeoff in AWS, where you have to decide if the convenience of not having to configure and manage a cluster is worth it. However, before you go with the cheaper option, you have to factor in additional costs with EMR, such as what it takes to maintain a cluster. You might need to factor in the cost of a cluster administrator into your total cost analysis. With all costs considered, you might even find that Glue may be cheaper in the long run. 

Another factor to consider is that Glue terminates as soon as the job executes. With Glue, you only pay for the time it runs. If you have longer-running EMR clusters, you will pay for idle time where the cluster is sitting there, not performing any work.

With Glue, there are three limitations you need to be aware of:

  1. It has a default limitation on how much CPU and RAM you can use for your jobs. The biggest worker type you can use currently has 8 vCPU and 32 GB of RAM. The number of workers you can scale up to in Glue by default is 100. So if you need more performance than Glue can provide to you, EMR is the better choice. 

  2. Glue is a true ETL service. While it can do light machine learning analysis and can be paired with Amazon Athena for data analysis, Amazon EMR outperforms Glue for both Machine Learning and with data analysis using engines like Presto. 

  3. Ultimately, if your workload requires any other engine other than Spark, you should use EMR. 

EMR on EC2 vs Glue is a fairly straightforward comparison. However, the differences between EMR Serverless and Glue are a little less obvious. Both EMR Serverless and Glue require no infrastructure maintenance and are more expensive than EMR on EC2. The biggest difference is use case. Like EMR on EC2, you can use EMR Serverless for use cases beyond ETL. With Glue, while it does support light machine learning transformations, it is mostly considered an ETL tool. Because of this, Glue Studio offers more ETL tooling that EMR serverless doesn’t natively support, such as a graphical ETL interface, built-in scheduling, and the ability to build pipelines from Glue components. 

If you already use EMR and have pre-existing Spark or Hive jobs, it may be worthwhile to consider running these jobs on EMR serverless. This lets you use a familiar tool without the maintenance of cluster management. 

Ultimately, if you need flexibility with how you manage the engine or the underlying infrastructure, EMR on EC2 is best for you. Otherwise, if you need to run short-lived jobs that will run in an Apache Spark environment, Glue or EMR Serverless will save you time by managing the infrastructure for you.

About the Author

Alana Layton is an experienced technical trainer, technical content developer, and cloud engineer living out of Seattle, Washington. Her career has included teaching about AWS all over the world, creating AWS content that is fun, and working in consulting. She currently holds six AWS certifications. Outside of Cloud Academy, you can find her testing her knowledge in bar trivia, reading, or training for a marathon.