Amazon EMR: Five Ways to Improve the Way You Use Hadoop

Amazon EMR (Elastic MapReduce) allows developers to avoid some of the burdens of setting up and administrating Hadoop tasks. Learn how to optimize it.

Apache Hadoop is an open source framework designed to distribute the storage and processing of massive data sets across virtually limitless servers. Amazon EMR (Elastic MapReduce) is a particularly popular service from Amazon that is used by developers trying to avoid the burden of set up and administration, and concentrate on working with their data.

Over the years, Amazon EMR has undergone many transformations. AWS is constantly working to improve their product, pushing new updates for EMR with every new Hadoop release. Amazon EMR now integrates with versatile Hadoop Ecosystem applications, offering an improved core architecture and an even more simplified interface.

For those who have stumbled upon this blog having little or no knowledge of Amazon EMR, but are familiar with Hadoop, here is a very quick overview:

  • EMR is Hadoop-as-a-Service from Amazon Web Services (AWS).
  • EMR supports Hadoop 2.6.0, Hive 1.0.0, Mahout 0.10.0, Pig 0.14.0, Hue 3.7.1, and Spark 1.4.1
  • The MapR distributions supported by EMR are – MapR 4.0.2 (MapR3/MapR5/MapR7 with Hadoop 2.4.0, Hive 0.13.1, and Pig 0.12.0).
  • Cost effective and integrated with other AWS Services.
  • Flexible resource utilization model.
  • No capacity planning, Hardware-on-Demand.
  • Easy to use with a flexible hourly usage model for clusters.
  • Integrated with other AWS services like S3, CloudFormation, Redshift, SQS, DynamoDB, and Cloudwatch.

The current EMR release (EMR-4.1.0) is based on the Apache Bigtop project. Describing Bigtop is well beyond the scope of this post, but you can read about it here. Instead, we are going to talk about five exciting – and unique – Amazon EMR features.

Amazon EMR Components

The current Amazon EMR release adds elements necessary to bring EMR up to date. The components are either community contributed editions or developed in-house at AWS. For example, Hadoop itself is a community edition, while the Amazon DynamoDB connector (emr-ddb-3.0.0) comes exclusively with EMR. Here is Amazon’s components guide.

You can also install recommended third-party software packages on your cluster using Bootstrap Actions. Third party libraries can be packaged directly into your Mapper or Reducer executable. Alternatively, you could upload statically compiled executables using the Hadoop distributed cache mechanism.

EMR Hadoop Nodes

Unlike standard distributions, there are three types of EMR Hadoop Nodes.

  • Master Node: Master Node runs NameNode, Resource Manager in YARN.
  • Slave Node-Core: Slave Node Core runs HDFS and Node Manager.
  • Slave Node-Task: Slave Node Task runs a Node Manager, but not HDFS.

Amazon EMR - nodes
Slave nodes in EMR are of particular interest. In EMR, Core nodes and Task nodes constitute a cluster’s slave nodes. Core nodes include Task Trackers and Data Nodes, with Data Nodes running the HDFS distributed file system. Since they store HDFS data, Data Nodes cannot be removed from a running cluster. Task Nodes, on the other hand, only act as Task Trackers and have no HDFS restrictions. Task nodes can, therefore, be scaled up and down according to the changing processing needs of a specific job. That’s how EMR supports dynamic clustering.

With HDFS out of picture within Task Slave Nodes, node failures or the addition of new nodes are far simpler to deal with, as there is no need for HDFS rebalancing.

EMR File System (EMRFS)

EMRFS is an extension of HDFS, which allows an Amazon EMR cluster to store and access data from Amazon S3. Amazon S3 is a great place to store huge data because of its low cost, durability, and availability. But one potential problem with S3 is its eventual consistency model. With eventual consistency, you might not get the updated objects as soon as they are added to your bucket. This might be a concern during certain multi-step ETL processing.

To address the issue, EMR provides something called consistent-view. By creating a DynamoDB database to track the data in S3, the consistent view provides read-after-write consistency and improved performance. The consistent view can be added to and enabled in an EMR cluster. Be aware that there is a small cost overhead for consistent view’s DynamoDB usage.

Transient and Long Running Clusters

With EMR, you can choose between running a non-committed cluster, called a transient cluster, or a long-running cluster for larger workloads. With a transient cluster, after the processing job is done, the cluster will be automatically terminated. That ensures your AWS bill properly reflects your actual use. Transient clusters are particularly suitable for periodic jobs.

A long-running cluster, on the other hand, is meant for persistent job execution. Imagine that you need to upload a huge amount of data for EMR processing. It can sometimes be inefficient to load it in smaller packages. With long-running clusters, you can query the cluster continuously or even periodically as it will be running even if there are no jobs in the queue.

Making a long-running cluster is easy. As the administrator, you will need to choose NO for auto-termination in Advanced Options -> Steps. That’s it!

Using S3Distcp to Move data between HDFS and S3

S3DistCp is an extension of the DistCp tool that lets you move large amounts of data between HDFS and S3 in a distributed manner. S3DistCp is more scalable and efficient for parallel copying large numbers of objects across buckets and between AWS accounts. S3DistCp copies data using distributed map-reduce jobs. However, the main benefit S3DistCp provides over DistCp, is by having a reducer run multiple HTTP upload threads to upload the files in parallel.

You can add S3DistCp as a step to EMR job in the AWS CLI:

aws emr add-steps --cluster-id j-1234MYCLUSTERXXXXX --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,\
Args=["--s3Endpoint,s3-eu-west-1.amazonaws.com","--src,s3://mybucket/logs/j-j-1234MYCLUSTERXXXXX/node/","--dest,hdfs:///output","--srcPattern,.*[a-zA-Z,]+"]

Or using EMR console like this:
Amazon EMR - s3distcp
To copy files from S3 to HDFS, you can run this command in the AWS CLI:

aws emr add-steps --cluster-id j-1234MYCLUSTERXXXXX --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,\
Args=["--src,s3://mybucket/logs/j-1234MYCLUSTERXXXXX/node/","--dest,hdfs:///output","--srcPattern,.*daemons.*-hadoop-.*"]

Conclusion

Amazon EMR allows organizations to launch Hadoop clusters and jobs almost instantaneously. With the backing of AWS’s tested infrastructure and services and their seamless integration between EMR and services like DynamoDB, Redshift, SQS, and Kinesis, users have many opportunities to explore.

Would you like to learn more? See how AOL significantly optimized its Amazon EMR infrastructure. Also, Cloud Academy offers a hands-on lab guiding you through the process of deploying S3-based data to an Amazon EMR cluster.

Do you have your own Hadoop or EMR experiences you’d like to share? Feel free to comment below.

Avatar

Written by

Chandan Patra

Cloud Computing and Big Data professional with 10 years of experience in pre-sales, architecture, design, build and troubleshooting with best engineering practices. Specialities: Cloud Computing - AWS, DevOps(Chef), Hadoop Ecosystem, Storm & Kafka, ELK Stack, NoSQL, Java, Spring, Hibernate, Web Service

Related Posts

Avatar
Jeremy Cook
— September 17, 2019

Cloud Migration Risks & Benefits

If you’re like most businesses, you already have at least one workload running in the cloud. However, that doesn’t mean that cloud migration is right for everyone. While cloud environments are generally scalable, reliable, and highly available, those won’t be the only considerations dri...

Read more
  • AWS
  • Azure
  • Cloud Migration
Joe Nemer
Joe Nemer
— September 12, 2019

Real-Time Application Monitoring with Amazon Kinesis

Amazon Kinesis is a real-time data streaming service that makes it easy to collect, process, and analyze data so you can get quick insights and react as fast as possible to new information.  With Amazon Kinesis you can ingest real-time data such as application logs, website clickstre...

Read more
  • amazon kinesis
  • AWS
  • Stream Analytics
  • Streaming data
Joe Nemer
Joe Nemer
— September 6, 2019

Google Cloud Functions vs. AWS Lambda: The Fight for Serverless Cloud Domination

Serverless computing: What is it and why is it important? A quick background The general concept of serverless computing was introduced to the market by Amazon Web Services (AWS) around 2014 with the release of AWS Lambda. As we know, cloud computing has made it possible for users to ...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
Joe Nemer
Joe Nemer
— September 3, 2019

Google Vision vs. Amazon Rekognition: A Vendor-Neutral Comparison

Google Cloud Vision and Amazon Rekognition offer a broad spectrum of solutions, some of which are comparable in terms of functional details, quality, performance, and costs. This post is a fact-based comparative analysis on Google Vision vs. Amazon Rekognition and will focus on the tech...

Read more
  • Amazon Rekognition
  • AWS
  • Google Cloud Platform
  • Google Vision
Alisha Reyes
Alisha Reyes
— August 30, 2019

New on Cloud Academy: CISSP, AWS, Azure, & DevOps Labs, Python for Beginners, and more…

As Hurricane Dorian intensifies, it looks like Floridians across the entire state might have to hunker down for another big one. If you've gone through a hurricane, you know that preparing for one is no joke. You'll need a survival kit with plenty of water, flashlights, batteries, and n...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • New content
  • Product Feature
  • Python programming
Joe Nemer
Joe Nemer
— August 27, 2019

Amazon Route 53: Why You Should Consider DNS Migration

What Amazon Route 53 brings to the DNS table Amazon Route 53 is a highly available and scalable Domain Name System (DNS) service offered by AWS. It is named by the TCP or UDP port 53, which is where DNS server requests are addressed. Like any DNS service, Route 53 handles domain regist...

Read more
  • Amazon
  • AWS
  • Cloud Migration
  • DNS
  • Route 53
Alisha Reyes
Alisha Reyes
— August 22, 2019

How to Unlock Complimentary Access to Cloud Academy

Are you looking to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Cloud Security, Python, Java, or another technical skill? Then you'll want to mark your calendars for August 23, 2019. Starting Friday at 12:00 a.m. PDT (3:00 a.m. EDT), Cloud Academy is offering c...

Read more
  • AWS
  • Azure
  • cloud academy content
  • complimentary access
  • GCP
  • on the house
Avatar
Michael Sheehy
— August 19, 2019

What Exactly Is a Cloud Architect and How Do You Become One?

One of the buzzwords surrounding the cloud that I'm sure you've heard is "Cloud Architect." In this article, I will outline my understanding of what a cloud architect does and I'll analyze the skills and certifications necessary to become one. I will also list some of the types of jobs ...

Read more
  • AWS
  • Cloud Computing
Avatar
Nitheesh Poojary
— August 19, 2019

Boto: Using Python to Automate AWS Services

Boto allows you to write scripts to automate things like starting AWS EC2 instances Boto is a Python package that provides programmatic connectivity to Amazon Web Services (AWS). AWS offers a range of services for dynamically scaling servers including the core compute service, Elastic...

Read more
  • Automated AWS Services
  • AWS
  • Boto
  • Python
Avatar
Andrew Larkin
— August 13, 2019

Content Roadmap: AZ-500, ITIL 4, MS-100, Google Cloud Associate Engineer, and More

Last month, Cloud Academy joined forces with QA, the UK’s largest B2B skills provider, and it put us in an excellent position to solve a massive skills gap problem. As a result of this collaboration, you will see our training library grow with additions from QA’s massive catalog of 500+...

Read more
  • AWS
  • Azure
  • content roadmap
  • Google Cloud Platform
Avatar
Adam Hawkins
— August 9, 2019

DevSecOps: How to Secure DevOps Environments

Security has been a friction point when discussing DevOps. This stems from the assumption that DevOps teams move too fast to handle security concerns. This makes sense if Information Security (InfoSec) is separate from the DevOps value stream, or if development velocity exceeds the band...

Read more
  • AWS
  • cloud security
  • DevOps
  • DevSecOps
  • Security
Avatar
Stefano Giacone
— August 8, 2019

Test Your Cloud Knowledge on AWS, Azure, or Google Cloud Platform

Cloud skills are in demand | In today's digital era, employers are constantly seeking skilled professionals with working knowledge of AWS, Azure, and Google Cloud Platform. According to the 2019 Trends in Cloud Transformation report by 451 Research: Business and IT transformations re...

Read more
  • AWS
  • Cloud skills
  • Google Cloud
  • Microsoft Azure