Amazon EMR vs. AWS Glue for ETL
Start course
5h 1m

This section provides detail on the AWS management services relevant to the Solution Architect Associate exam. These services are used to help you audit, monitor and evaluate your AWS infrastructure and resources.  These management services form a core component of running resilient and performant architectures. 

Want more? Try a lab playground or do a Lab Challenge!

Learning Objectives

  • Understand the benefits of using AWS CloudWatch and audit logs to manage your infrastructure
  • Learn how to record and track API requests using AWS CloudTrail
  • Learn what AWS Config is and its components
  • Manage your accounts with AWS Organizations, including single sign-on with AWS SSO
  • Learn how to carry out logging with CloudWatch, CloudTrail, CloudFront, and VPC Flow Logs
  • Understand how to design cost-optimized architectures in AWS
  • Learn about AWS data transformation tools such as AWS Glue and data visualization services like Amazon Athena and QuickSight

In this video, I’ll be comparing Amazon EMR vs AWS Glue for ETL. Before we get deeper into the two services, it’s important to note that Amazon EMR does have multiple deployment options. You can use EMR on EC2, on EKS, or use EMR serverless. In this video, I’ll be focusing mostly on EMR on EC2, with some mentions to EMR serverless. 

With that being said, our next fight for the evening is: AWS Glue taking on Amazon EMR. Who will win? 

In one corner, we have Amazon EMR, a big data platform that’s designed not only for ETL but also for machine learning and data analysis. In the other corner, we have AWS Glue, a data integration service, that provides Glue Studio for ETL. It also includes a Data Catalog, Glue DataBrew for no-code transformations, and Glue Elastic Views. 

Let’s look at these two services from three different perspectives

  1. Ease of use 

  2. Pricing

  3. Limitations

Ease of use for any tool in AWS is often inversely related to control. Tools that AWS says are “ easy to use” generally provide the user with less control over the service. The same is true the other way around, tools that provide a lot of control are typically more complex to use. 

You can see this clearly with EMR and Glue. For example, EMR on EC2 provides maximum control over the service. You can optimize, manage, and scale your cluster and compute nodes. You can take advantage of EC2 instance types, sizes, and pricing options such as Spot, Reservations, and Savings Plans. You can install a wide range of open-source tools, such as Hive, Presto, HBase, Spark, and more to fit your use case. And you can choose how long you run your EMR cluster. It could be a longer running cluster that is available 24/7, or it could be a transient cluster that’s provisioned, runs the proposed jobs, and then terminates soon after. You have control over all of it. 

However, the freedom of choice can make the service more complex to manage. Configuring and maintaining the engine and the cluster can be a full-time job. You may need to dedicate resources in your engineering teams to manage this underlying infrastructure.

With Glue, your choices decrease because it is serverless. You no longer get to manage the underlying EC2 instances and storage. All cluster, node, and engine maintenance disappears. From the infrastructure maintenance perspective, it is simpler. However, that also means you no longer get to choose EC2 instance types, sizes, or pricing options. And you also no longer get to choose from a range of open-source engines. Glue can only run your ETL jobs in an Apache Spark environment. The other factor is that Glue terminates as soon as your job finishes executing, so support for longer-running clusters is not possible.

The convenience of serverless is helpful, but there is a price for this convenience. Glue is, at face value, more expensive than EMR on EC2. This is a common tradeoff in AWS, where you have to decide if the convenience of not having to configure and manage a cluster is worth it. However, before you go with the cheaper option, you have to factor in additional costs with EMR, such as what it takes to maintain a cluster. You might need to factor in the cost of a cluster administrator into your total cost analysis. With all costs considered, you might even find that Glue may be cheaper in the long run. 

Another factor to consider is that Glue terminates as soon as the job executes. With Glue, you only pay for the time it runs. If you have longer-running EMR clusters, you will pay for idle time where the cluster is sitting there, not performing any work.

With Glue, there are three limitations you need to be aware of:

  1. It has a default limitation on how much CPU and RAM you can use for your jobs. The biggest worker type you can use currently has 8 vCPU and 32 GB of RAM. The number of workers you can scale up to in Glue by default is 100. So if you need more performance than Glue can provide to you, EMR is the better choice. 

  2. Glue is a true ETL service. While it can do light machine learning analysis and can be paired with Amazon Athena for data analysis, Amazon EMR outperforms Glue for both Machine Learning and with data analysis using engines like Presto. 

  3. Ultimately, if your workload requires any other engine other than Spark, you should use EMR. 

EMR on EC2 vs Glue is a fairly straightforward comparison. However, the differences between EMR Serverless and Glue are a little less obvious. Both EMR Serverless and Glue require no infrastructure maintenance and are more expensive than EMR on EC2. The biggest difference is use case. Like EMR on EC2, you can use EMR Serverless for use cases beyond ETL. With Glue, while it does support light machine learning transformations, it is mostly considered an ETL tool. Because of this, Glue Studio offers more ETL tooling that EMR serverless doesn’t natively support, such as a graphical ETL interface, built-in scheduling, and the ability to build pipelines from Glue components. 

If you already use EMR and have pre-existing Spark or Hive jobs, it may be worthwhile to consider running these jobs on EMR serverless. This lets you use a familiar tool without the maintenance of cluster management. 

Ultimately, if you need flexibility with how you manage the engine or the underlying infrastructure, EMR on EC2 is best for you. Otherwise, if you need to run short-lived jobs that will run in an Apache Spark environment, Glue or EMR Serverless will save you time by managing the infrastructure for you.

About the Author
Learning Paths

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.