Getting Started with Amazon Elastic MapReduce

Intermediate

47 students completed the lab in ~1h:15m

Total available time: 1h:30m

Be the first to rate this lab!

Get started with Amazon Elastic MapReduce (EMR) and learn the fundamentals of EMR

Lab Overview

Amazon Elastic MapReduce (Amazon EMR) makes it easy to process vast amounts of data in a variety of applications, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. Amazon EMR uses Hadoop, an open source framework, to distribute raw data and processing across a resizable cluster of Amazon EC2 instances.

Hadoop uses a distributed processing architecture called MapReduce in which a task is mapped to a set of servers for processing. The results of the computation performed by those servers is then reduced down to a single output set. 

A high level view of the EMR workflow is as follows:

  1. Load the input dataset
  2. Execute a Map-Reduce job
  3. Store the job results in HDFS
  4. View the job results from HDFS

The focus of this lab is configuring and launching an EMR cluster. You will be provided with sample input data sets and sample applications to process the data sets. Treating the application and data set as a "black box" will lift unneeded complexities and free you up to concentrate on the configuration component. Note that Amazon EMR does a massive amount of heavy lifting for you. In addition to providing security, reliability, monitoring, scalability, integration with other Amazon services and the potential for cost savings, Amazon tackles the deployment as well. For example, Amazon will configure the instances in your cluster with all the necessary software and versions of the software to process the tasks you submit.

Lab Objectives

Upon completion of this lab you will be able to:

  • Explain the key features and benefits of Amazon EMR
  • Configure and launch a cluster in two different launch modes
  • Submit tasks for your cluster to process
  • Check the status of your cluster and the tasks it processes
  • Terminate, clone, reconfigure and launch a cluster
  • Clone a job for your cluster to process
  • View logs and results

Lab Prerequisites

You should be familiar with:

  • Amazon Management Console
  • Amazon Simple Storage Service (S3)
  • Amazon Elastic Compute Cloud (EC2)
  • Big Data concepts

Lab Environment

After completing the lab instructions the environment should look similar to:

Follow these steps to learn by building helpful cloud resources

Log In to the Amazon Web Service Console

Your first step to start the Lab experience

Creating an S3 Bucket for EMR

Create an S3 Bucket for Elastic MapReduce

Creating an EMR Cluster

Configure, create and launch an EMR cluster

Adding a Step to your running Cluster

Add a Step for the EMR Cluster to process 

Viewing the EMR Cluster and Step Results

Learn multiple ways to view processing results

Terminating and Cloning a Cluster

Terminate, clone and start a new EMR Cluster

Adding a new Step for a Cloned EMR Cluster to Process

Add a new Step to process a Streaming program