Amazon EMR (Elastic MapReduce) allows developers to avoid some of the burdens of setting up and administrating Hadoop tasks. Learn how to optimize it.
Apache Hadoop is an open source framework designed to distribute the storage and processing of massive data sets across virtually limitless servers. Amazon EMR (Elastic MapReduce) is a particularly popular service from Amazon that is used by developers trying to avoid the burden of set up and administration, and concentrate on working with their data.
Over the years, Amazon EMR has undergone many transformations. AWS is constantly working to improve their product, pushing new updates for EMR with every new Hadoop release. Amazon EMR now integrates with versatile Hadoop Ecosystem applications, offering an improved core architecture and an even more simplified interface.
For those who have stumbled upon this blog having little or no knowledge of Amazon EMR, but are familiar with Hadoop, here is a very quick overview:
- EMR is Hadoop-as-a-Service from Amazon Web Services (AWS).
- EMR supports Hadoop 2.6.0, Hive 1.0.0, Mahout 0.10.0, Pig 0.14.0, Hue 3.7.1, and Spark 1.4.1
- The MapR distributions supported by EMR are – MapR 4.0.2 (MapR3/MapR5/MapR7 with Hadoop 2.4.0, Hive 0.13.1, and Pig 0.12.0).
- Cost effective and integrated with other AWS Services.
- Flexible resource utilization model.
- No capacity planning, Hardware-on-Demand.
- Easy to use with a flexible hourly usage model for clusters.
- Integrated with other AWS services like S3, CloudFormation, Redshift, SQS, DynamoDB, and Cloudwatch.
The current EMR release (EMR-4.1.0) is based on the Apache Bigtop project. Describing Bigtop is well beyond the scope of this post, but you can read about it here. Instead, we are going to talk about five exciting – and unique – Amazon EMR features.
Amazon EMR Components
The current Amazon EMR release adds elements necessary to bring EMR up to date. The components are either community contributed editions or developed in-house at AWS. For example, Hadoop itself is a community edition, while the Amazon DynamoDB connector (emr-ddb-3.0.0) comes exclusively with EMR. Here is Amazon’s components guide.
You can also install recommended third-party software packages on your cluster using Bootstrap Actions. Third party libraries can be packaged directly into your Mapper or Reducer executable. Alternatively, you could upload statically compiled executables using the Hadoop distributed cache mechanism.
EMR Hadoop Nodes
Unlike standard distributions, there are three types of EMR Hadoop Nodes.
- Master Node: Master Node runs NameNode, Resource Manager in YARN.
- Slave Node-Core: Slave Node Core runs HDFS and Node Manager.
- Slave Node-Task: Slave Node Task runs a Node Manager, but not HDFS.
Slave nodes in EMR are of particular interest. In EMR, Core nodes and Task nodes constitute a cluster’s slave nodes. Core nodes include Task Trackers and Data Nodes, with Data Nodes running the HDFS distributed file system. Since they store HDFS data, Data Nodes cannot be removed from a running cluster. Task Nodes, on the other hand, only act as Task Trackers and have no HDFS restrictions. Task nodes can, therefore, be scaled up and down according to the changing processing needs of a specific job. That’s how EMR supports dynamic clustering.
With HDFS out of picture within Task Slave Nodes, node failures or the addition of new nodes are far simpler to deal with, as there is no need for HDFS rebalancing.
EMR File System (EMRFS)
EMRFS is an extension of HDFS, which allows an Amazon EMR cluster to store and access data from Amazon S3. Amazon S3 is a great place to store huge data because of its low cost, durability, and availability. But one potential problem with S3 is its eventual consistency model. With eventual consistency, you might not get the updated objects as soon as they are added to your bucket. This might be a concern during certain multi-step ETL processing.
To address the issue, EMR provides something called consistent-view. By creating a DynamoDB database to track the data in S3, the consistent view provides read-after-write consistency and improved performance. The consistent view can be added to and enabled in an EMR cluster. Be aware that there is a small cost overhead for consistent view’s DynamoDB usage.
Transient and Long Running Clusters
With EMR, you can choose between running a non-committed cluster, called a transient cluster, or a long-running cluster for larger workloads. With a transient cluster, after the processing job is done, the cluster will be automatically terminated. That ensures your AWS bill properly reflects your actual use. Transient clusters are particularly suitable for periodic jobs.
A long-running cluster, on the other hand, is meant for persistent job execution. Imagine that you need to upload a huge amount of data for EMR processing. It can sometimes be inefficient to load it in smaller packages. With long-running clusters, you can query the cluster continuously or even periodically as it will be running even if there are no jobs in the queue.
Making a long-running cluster is easy. As the administrator, you will need to choose NO for auto-termination in Advanced Options -> Steps. That’s it!
Using S3Distcp to Move data between HDFS and S3
S3DistCp is an extension of the DistCp tool that lets you move large amounts of data between HDFS and S3 in a distributed manner. S3DistCp is more scalable and efficient for parallel copying large numbers of objects across buckets and between AWS accounts. S3DistCp copies data using distributed map-reduce jobs. However, the main benefit S3DistCp provides over DistCp, is by having a reducer run multiple HTTP upload threads to upload the files in parallel.
You can add S3DistCp as a step to EMR job in the AWS CLI:
aws emr add-steps --cluster-id j-1234MYCLUSTERXXXXX --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,\ Args=["--s3Endpoint,s3-eu-west-1.amazonaws.com","--src,s3://mybucket/logs/j-j-1234MYCLUSTERXXXXX/node/","--dest,hdfs:///output","--srcPattern,.*[a-zA-Z,]+"]
aws emr add-steps --cluster-id j-1234MYCLUSTERXXXXX --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,\ Args=["--src,s3://mybucket/logs/j-1234MYCLUSTERXXXXX/node/","--dest,hdfs:///output","--srcPattern,.*daemons.*-hadoop-.*"]
Amazon EMR allows organizations to launch Hadoop clusters and jobs almost instantaneously. With the backing of AWS’s tested infrastructure and services and their seamless integration between EMR and services like DynamoDB, Redshift, SQS, and Kinesis, users have many opportunities to explore.
Would you like to learn more? See how AOL significantly optimized its Amazon EMR infrastructure. Also, Cloud Academy offers a hands-on lab guiding you through the process of deploying S3-based data to an Amazon EMR cluster.
Do you have your own Hadoop or EMR experiences you’d like to share? Feel free to comment below.
New Content: AWS Terraform, Java Programming Lab Challenges, Azure DP-900 & DP-300 Certification Exam Prep, Plus Plenty More Amazon, Google, Microsoft, and Big Data Courses
This month our Content Team continues building the catalog of courses for everyone learning about AWS, GCP, and Microsoft Azure. In addition, this month’s updates include several Java programming lab challenges and a couple of courses on big data. In total, we released five new learning...
Where Should You Be Focusing Your AWS Security Efforts?
Another day, another re:Invent session! This time I listened to Stephen Schmidt’s session, “AWS Security: Where we've been, where we're going.” Amongst covering the highlights of AWS security during 2020, a number of newly added AWS features/services were discussed, including: AWS Audit...
AWS re:Invent: 2020 Keynote Top Highlights and More
We’ve gotten through the first five days of the special all-virtual 2020 edition of AWS re:Invent. It’s always a really exciting time for practitioners in the field to see what features and services AWS has cooked up for the year ahead. This year’s conference is a marathon and not a...
WARNING: Great Cloud Content Ahead
At Cloud Academy, content is at the heart of what we do. We work with the world’s leading cloud and operations teams to develop video courses and learning paths that accelerate teams and drive digital transformation. First and foremost, we listen to our customers’ needs and we stay ahea...
Excelling in AWS, Azure, and Beyond – How Danut Prisacaru Prepares for the Future
Meet Danut Prisacaru. Danut has been a Software Architect for the past 10 years and has been involved in Software Engineering for 30 years. He’s passionate about software and learning, and jokes that coding is basically the only thing he can do well (!). We think his enthusiasm shines t...
New Content: AWS Data Analytics – Specialty Certification, Azure AI-900 Certification, Plus New Learning Paths, Courses, Labs, and More
This month our Content Team released two big certification Learning Paths: the AWS Certified Data Analytics - Speciality, and the Azure AI Fundamentals AI-900. In total, we released four new Learning Paths, 16 courses, 24 assessments, and 11 labs. New content on Cloud Academy At any ...
New Content: Azure DP-100 Certification, Alibaba Cloud Certified Associate Prep, 13 Security Labs, and Much More
This past month our Content Team served up a heaping spoonful of new and updated content. Not only did our experts release the brand new Azure DP-100 Certification Learning Path, but they also created 18 new hands-on labs — and so much more! New content on Cloud Academy At any time, y...
AWS Certification Practice Exam: What to Expect from Test Questions
If you’re building applications on the AWS cloud or looking to get started in cloud computing, certification is a way to build deep knowledge in key services unique to the AWS platform. AWS currently offers 12 certifications that cover major cloud roles including Solutions Architect, De...
Overcoming Unprecedented Business Challenges with AWS
From auto-scaling applications with high availability to video conferencing that’s used by everyone, every day — cloud technology has never been more popular or in-demand. But what does this mean for experienced cloud professionals and the challenges they face as they carve out a new p...
Constant Content: Cloud Academy’s Q3 2020 Roadmap
Hello — Andy Larkin here, VP of Content at Cloud Academy. I am pleased to release our roadmap for the next three months of 2020 — August through October. Let me walk you through the content we have planned for you and how this content can help you gain skills, get certified, and...
New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More
This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...
Blog Digest: Which Certifications Should I Get?, The 12 Microsoft Azure Certifications, 6 Ways to Prevent a Data Breach, and More
This month, we were excited to announce that Cloud Academy was recognized in the G2 Summer 2020 reports! These reports highlight the top-rated solutions in the industry, as chosen by the source that matters most: customers. We're grateful to have been nominated as a High Performer in se...