This section of the AWS Certified Solutions Architect - Professional learning path introduces the AWS management and governance services relevant to the AWS Certified Solutions Architect - Professional exam. These services are used to help you audit, monitor, and evaluate your AWS infrastructure and resources and form a core component of resilient and performant architectures.
- Understand the benefits of using AWS CloudWatch and audit logs to manage your infrastructure
- Learn how to record and track API requests using AWS CloudTrail
- Learn what AWS Config is and its components
- Manage multi-account environments with AWS Organizations and Control Tower
- Learn how to carry out logging with CloudWatch, CloudTrail, CloudFront, and VPC Flow Logs
- Learn about AWS data transformation tools such as AWS Glue and data visualization services like Amazon Athena and QuickSight
- Learn how AWS CloudFormation can be used to represent your infrastructure as code (IaC)
- Understand SLAs in AWS
In this video, I’ll be comparing Amazon EMR vs AWS Glue for ETL. Before we get deeper into the two services, it’s important to note that Amazon EMR does have multiple deployment options. You can use EMR on EC2, on EKS, or use EMR serverless. In this video, I’ll be focusing mostly on EMR on EC2, with some mentions to EMR serverless.
With that being said, our next fight for the evening is: AWS Glue taking on Amazon EMR. Who will win?
In one corner, we have Amazon EMR, a big data platform that’s designed not only for ETL but also for machine learning and data analysis. In the other corner, we have AWS Glue, a data integration service, that provides Glue Studio for ETL. It also includes a Data Catalog, Glue DataBrew for no-code transformations, and Glue Elastic Views.
Let’s look at these two services from three different perspectives
Ease of use
Ease of use for any tool in AWS is often inversely related to control. Tools that AWS says are “ easy to use” generally provide the user with less control over the service. The same is true the other way around, tools that provide a lot of control are typically more complex to use.
You can see this clearly with EMR and Glue. For example, EMR on EC2 provides maximum control over the service. You can optimize, manage, and scale your cluster and compute nodes. You can take advantage of EC2 instance types, sizes, and pricing options such as Spot, Reservations, and Savings Plans. You can install a wide range of open-source tools, such as Hive, Presto, HBase, Spark, and more to fit your use case. And you can choose how long you run your EMR cluster. It could be a longer running cluster that is available 24/7, or it could be a transient cluster that’s provisioned, runs the proposed jobs, and then terminates soon after. You have control over all of it.
However, the freedom of choice can make the service more complex to manage. Configuring and maintaining the engine and the cluster can be a full-time job. You may need to dedicate resources in your engineering teams to manage this underlying infrastructure.
With Glue, your choices decrease because it is serverless. You no longer get to manage the underlying EC2 instances and storage. All cluster, node, and engine maintenance disappears. From the infrastructure maintenance perspective, it is simpler. However, that also means you no longer get to choose EC2 instance types, sizes, or pricing options. And you also no longer get to choose from a range of open-source engines. Glue can only run your ETL jobs in an Apache Spark environment. The other factor is that Glue terminates as soon as your job finishes executing, so support for longer-running clusters is not possible.
The convenience of serverless is helpful, but there is a price for this convenience. Glue is, at face value, more expensive than EMR on EC2. This is a common tradeoff in AWS, where you have to decide if the convenience of not having to configure and manage a cluster is worth it. However, before you go with the cheaper option, you have to factor in additional costs with EMR, such as what it takes to maintain a cluster. You might need to factor in the cost of a cluster administrator into your total cost analysis. With all costs considered, you might even find that Glue may be cheaper in the long run.
Another factor to consider is that Glue terminates as soon as the job executes. With Glue, you only pay for the time it runs. If you have longer-running EMR clusters, you will pay for idle time where the cluster is sitting there, not performing any work.
With Glue, there are three limitations you need to be aware of:
It has a default limitation on how much CPU and RAM you can use for your jobs. The biggest worker type you can use currently has 8 vCPU and 32 GB of RAM. The number of workers you can scale up to in Glue by default is 100. So if you need more performance than Glue can provide to you, EMR is the better choice.
Glue is a true ETL service. While it can do light machine learning analysis and can be paired with Amazon Athena for data analysis, Amazon EMR outperforms Glue for both Machine Learning and with data analysis using engines like Presto.
Ultimately, if your workload requires any other engine other than Spark, you should use EMR.
EMR on EC2 vs Glue is a fairly straightforward comparison. However, the differences between EMR Serverless and Glue are a little less obvious. Both EMR Serverless and Glue require no infrastructure maintenance and are more expensive than EMR on EC2. The biggest difference is use case. Like EMR on EC2, you can use EMR Serverless for use cases beyond ETL. With Glue, while it does support light machine learning transformations, it is mostly considered an ETL tool. Because of this, Glue Studio offers more ETL tooling that EMR serverless doesn’t natively support, such as a graphical ETL interface, built-in scheduling, and the ability to build pipelines from Glue components.
If you already use EMR and have pre-existing Spark or Hive jobs, it may be worthwhile to consider running these jobs on EMR serverless. This lets you use a familiar tool without the maintenance of cluster management.
Ultimately, if you need flexibility with how you manage the engine or the underlying infrastructure, EMR on EC2 is best for you. Otherwise, if you need to run short-lived jobs that will run in an Apache Spark environment, Glue or EMR Serverless will save you time by managing the infrastructure for you.
Danny has over 20 years of IT experience as a software developer, cloud engineer, and technical trainer. After attending a conference on cloud computing in 2009, he knew he wanted to build his career around what was still a very new, emerging technology at the time — and share this transformational knowledge with others. He has spoken to IT professional audiences at local, regional, and national user groups and conferences. He has delivered in-person classroom and virtual training, interactive webinars, and authored video training courses covering many different technologies, including Amazon Web Services. He currently has six active AWS certifications, including certifications at the Professional and Specialty level.