Amazon EMR: Five Ways to Improve the Way You Use Hadoop

Amazon EMR (Elastic MapReduce) allows developers to avoid some of the burdens of setting up and administrating Hadoop tasks. Learn how to optimize it.

Apache Hadoop is an open source framework designed to distribute the storage and processing of massive data sets across virtually limitless servers. Amazon EMR (Elastic MapReduce) is a particularly popular service from Amazon that is used by developers trying to avoid the burden of set up and administration, and concentrate on working with their data.

Over the years, Amazon EMR has undergone many transformations. AWS is constantly working to improve their product, pushing new updates for EMR with every new Hadoop release. Amazon EMR now integrates with versatile Hadoop Ecosystem applications, offering an improved core architecture and an even more simplified interface.

For those who have stumbled upon this blog having little or no knowledge of Amazon EMR, but are familiar with Hadoop, here is a very quick overview:

  • EMR is Hadoop-as-a-Service from Amazon Web Services (AWS).
  • EMR supports Hadoop 2.6.0, Hive 1.0.0, Mahout 0.10.0, Pig 0.14.0, Hue 3.7.1, and Spark 1.4.1
  • The MapR distributions supported by EMR are – MapR 4.0.2 (MapR3/MapR5/MapR7 with Hadoop 2.4.0, Hive 0.13.1, and Pig 0.12.0).
  • Cost effective and integrated with other AWS Services.
  • Flexible resource utilization model.
  • No capacity planning, Hardware-on-Demand.
  • Easy to use with a flexible hourly usage model for clusters.
  • Integrated with other AWS services like S3, CloudFormation, Redshift, SQS, DynamoDB, and Cloudwatch.

The current EMR release (EMR-4.1.0) is based on the Apache Bigtop project. Describing Bigtop is well beyond the scope of this post, but you can read about it here. Instead, we are going to talk about five exciting – and unique – Amazon EMR features.

Amazon EMR Components

The current Amazon EMR release adds elements necessary to bring EMR up to date. The components are either community contributed editions or developed in-house at AWS. For example, Hadoop itself is a community edition, while the Amazon DynamoDB connector (emr-ddb-3.0.0) comes exclusively with EMR. Here is Amazon’s components guide.

You can also install recommended third-party software packages on your cluster using Bootstrap Actions. Third party libraries can be packaged directly into your Mapper or Reducer executable. Alternatively, you could upload statically compiled executables using the Hadoop distributed cache mechanism.

EMR Hadoop Nodes

Unlike standard distributions, there are three types of EMR Hadoop Nodes.

  • Master Node: Master Node runs NameNode, Resource Manager in YARN.
  • Slave Node-Core: Slave Node Core runs HDFS and Node Manager.
  • Slave Node-Task: Slave Node Task runs a Node Manager, but not HDFS.

Amazon EMR - nodes
Slave nodes in EMR are of particular interest. In EMR, Core nodes and Task nodes constitute a cluster’s slave nodes. Core nodes include Task Trackers and Data Nodes, with Data Nodes running the HDFS distributed file system. Since they store HDFS data, Data Nodes cannot be removed from a running cluster. Task Nodes, on the other hand, only act as Task Trackers and have no HDFS restrictions. Task nodes can, therefore, be scaled up and down according to the changing processing needs of a specific job. That’s how EMR supports dynamic clustering.

With HDFS out of picture within Task Slave Nodes, node failures or the addition of new nodes are far simpler to deal with, as there is no need for HDFS rebalancing.

EMR File System (EMRFS)

EMRFS is an extension of HDFS, which allows an Amazon EMR cluster to store and access data from Amazon S3. Amazon S3 is a great place to store huge data because of its low cost, durability, and availability. But one potential problem with S3 is its eventual consistency model. With eventual consistency, you might not get the updated objects as soon as they are added to your bucket. This might be a concern during certain multi-step ETL processing.

To address the issue, EMR provides something called consistent-view. By creating a DynamoDB database to track the data in S3, the consistent view provides read-after-write consistency and improved performance. The consistent view can be added to and enabled in an EMR cluster. Be aware that there is a small cost overhead for consistent view’s DynamoDB usage.

Transient and Long Running Clusters

With EMR, you can choose between running a non-committed cluster, called a transient cluster, or a long-running cluster for larger workloads. With a transient cluster, after the processing job is done, the cluster will be automatically terminated. That ensures your AWS bill properly reflects your actual use. Transient clusters are particularly suitable for periodic jobs.

A long-running cluster, on the other hand, is meant for persistent job execution. Imagine that you need to upload a huge amount of data for EMR processing. It can sometimes be inefficient to load it in smaller packages. With long-running clusters, you can query the cluster continuously or even periodically as it will be running even if there are no jobs in the queue.

Making a long-running cluster is easy. As the administrator, you will need to choose NO for auto-termination in Advanced Options -> Steps. That’s it!

Using S3Distcp to Move data between HDFS and S3

S3DistCp is an extension of the DistCp tool that lets you move large amounts of data between HDFS and S3 in a distributed manner. S3DistCp is more scalable and efficient for parallel copying large numbers of objects across buckets and between AWS accounts. S3DistCp copies data using distributed map-reduce jobs. However, the main benefit S3DistCp provides over DistCp, is by having a reducer run multiple HTTP upload threads to upload the files in parallel.

You can add S3DistCp as a step to EMR job in the AWS CLI:

aws emr add-steps --cluster-id j-1234MYCLUSTERXXXXX --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,\
Args=["--s3Endpoint,s3-eu-west-1.amazonaws.com","--src,s3://mybucket/logs/j-j-1234MYCLUSTERXXXXX/node/","--dest,hdfs:///output","--srcPattern,.*[a-zA-Z,]+"]

Or using EMR console like this:
Amazon EMR - s3distcp
To copy files from S3 to HDFS, you can run this command in the AWS CLI:

aws emr add-steps --cluster-id j-1234MYCLUSTERXXXXX --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,\
Args=["--src,s3://mybucket/logs/j-1234MYCLUSTERXXXXX/node/","--dest,hdfs:///output","--srcPattern,.*daemons.*-hadoop-.*"]

Conclusion

Amazon EMR allows organizations to launch Hadoop clusters and jobs almost instantaneously. With the backing of AWS’s tested infrastructure and services and their seamless integration between EMR and services like DynamoDB, Redshift, SQS, and Kinesis, users have many opportunities to explore.

Would you like to learn more? See how AOL significantly optimized its Amazon EMR infrastructure. Also, Cloud Academy offers a hands-on lab guiding you through the process of deploying S3-based data to an Amazon EMR cluster.

Do you have your own Hadoop or EMR experiences you’d like to share? Feel free to comment below.

Avatar

Written by

Chandan Patra

Cloud Computing and Big Data professional with 10 years of experience in pre-sales, architecture, design, build and troubleshooting with best engineering practices. Specialities: Cloud Computing - AWS, DevOps(Chef), Hadoop Ecosystem, Storm & Kafka, ELK Stack, NoSQL, Java, Spring, Hibernate, Web Service


Related Posts

Joe Nemer
Joe Nemer
— April 3, 2020

Breaking News: All AWS Certification Exams Now Available Online

Remote proctoring for all AWS certifications Cloud Academy is an Advanced AWS Technology Partner, and we are happy to announce all AWS certification exams are available online!  What does this mean for you? You can stay focused on your certification goal. Or you can start a certifica...

Read more
  • AWS
  • AWS certification
  • AWS Certifications
Connie Benton
Connie Benton
— April 1, 2020

How To Build a Career with AWS Certifications

From Iaas and PaaS solutions to digital marketing, cloud computing reshapes the world of technology. As the influence of this technology grows, so does investment. Tens of billions of dollars are being spent on cloud computing-related services each year. This influx is continuing to inc...

Read more
  • AWS
  • Certifications
Vijayakumar Athithan
Vijayakumar Athithan
— March 27, 2020

What is Cognito in AWS?

Web applications usually allow a valid username and password combination for successful sign in to the application. Modern authentication flows incorporate more approaches to ensure user authentication. When using AWS, this is no exception, thanks to the abilities and features offered b...

Read more
  • AWS
  • AWS Cognito
  • Solutions Architect
Avatar
Andrew Larkin
— March 20, 2020

The 12 AWS Certifications: Which is Right for You and Your Team?

As companies increasingly shift workloads to the public cloud, cloud computing has moved from a nice-to-have to a core competency in the enterprise. This shift requires a new set of skills to design, deploy, and manage applications in cloud computing. As the market leader and most ma...

Read more
  • AWS
  • AWS Certifications
Alisha Reyes
Alisha Reyes
— March 17, 2020

Cloud Academy’s Blog Digest: How Do AWS Certifications Increase Your Employability, How to Become a Microsoft Certified Azure Data Engineer, and more

With everything going on right now, it's likely that the only thing you've been reading lately is related to the coronavirus pandemic. It's important to stay informed during these times, but it's also good to jump into something that can take your mind off of the current situation for j...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • programming
  • Security
Avatar
Cloud Academy Team
— March 13, 2020

Which Certifications Should I Get?

As we mentioned in an earlier post, the old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and compan...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— March 7, 2020

New on Cloud Academy: Intro to GitOps; AWS Courses; Java, Python, Amazon Linux 2, Ubuntu, & Docker Playgrounds; and much more

New Lab Playgrounds This month, our Content Team released six new "playground labs." Our playground labs provide a safe and secure sandbox environment for you to explore your own ideas, follow along with Cloud Academy courses, or answer your own questions — all without having to instal...

Read more
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Alisha Reyes
Alisha Reyes
— March 6, 2020

New on Cloud Academy: Intro to GitOps; AWS Courses; Java, Python, Amazon Linux 2, Ubuntu, & Docker Playgrounds; and much more

New Lab Playgrounds This month, our Content Team released six new "playground labs." Our playground labs provide a safe and secure sandbox environment for you to explore your own ideas, follow along with Cloud Academy courses, or answer your own questions — all without having to instal...

Read more
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Patrick Navarro
Patrick Navarro
— March 4, 2020

AWS Certifications: How Do They Increase Your Employability and Progress Your Career?

AWS certifications are no walk in the park. They’re designed to validate in-depth, specialist knowledge and comprehensive experience, often requiring months of dedicated studying to earn even for those already working with the cloud platform. But the rewards that AWS professionals ca...

Read more
  • AWS
  • AWS certification
  • certification
Avatar
Chandan Patra
— February 21, 2020

Elasticsearch vs. CloudSearch: AWS Cloud Search Choices

Elasticsearch vs. CloudSearch: What's the main difference? Let's compare AWS-based cloud tools: Elasticsearch vs. CloudSearch. While both services use proven technologies, Elasticsearch is more popular, open source, and has a flexible API to use for customization; in comparison, CloudS...

Read more
  • AWS
  • Azure
  • cloudsearch
  • elasticsearch
Avatar
Andrew Larkin
— February 13, 2020

Cloud Academy Content Roadmap Updates

Welcome to our Q1 2020 roadmap. This is the content we plan to build over the next three months, between February 1 - and April 30, 2020. Let's look at some of our roadmap highlights. Atlassian Bamboo for CI/CD We had a lot of requests for practical guides on how to apply DevOps tool...

Read more
  • Artificial Intelligence
  • AWS
  • Azure
  • Docker
  • Google Cloud Platform
  • Kubernetes
  • Machine Learning
Alisha Reyes
Alisha Reyes
— February 7, 2020

New on Cloud Academy: Git Labs, CKA and CKAD Lab Challenges, AWS and Azure Learning Paths, AGILE, and Much More

We just kicked off our first Free Weekend of 2020. This means we've unlocked our Training Library for just 72 hours. Until Sunday at 11:59 pm (PST), you can get unlimited access to our industry-leading learning paths, courses, certification prep exams, and our most popular hands-on labs...

Read more
  • agile
  • AWS
  • Azure
  • Google Cloud Platform
  • Linux
  • OWASP
  • programming
  • red hat
  • scrum