Lab Steps

lock
Logging In to the Amazon Web Services Console
lock
Creating an S3 Bucket for EMR
lock
Creating an EMR Cluster
lock
Adding a Step to your running Cluster
lock
Viewing the EMR Cluster and Step Results
lock
Terminating and Cloning a Cluster
lock
Adding a new Step for a Cloned EMR Cluster to Process
Need help? Contact our support team

Here you can find the instructions for this specific Lab Step.

If you are ready for a real environment experience please start the Lab. Keep in mind that you'll need to start from the first step.

Introduction

Ultimately, an EMR cluster is comprised of EC2 instances. Amazon deploys and manages the instances in an EMR cluster, offloading much of the complexity for you. Depending on the size of the data set you need to process and the time you need to access results, the number and size of the instances will vary. A cluster has three types of nodes:

  1. Primary - Manages the cluster and all software components needed to distribute both data and tasks to other nodes. The primary also tracks the status of tasks and the overall health of the cluster. Another node can be a Core or Task node. 
  2. Core - A node that includes software to run tasks and store data. (Data is stored in a Hadoop Distributed File System (HDFS).)
  3. Task - A node that includes software to run tasks but does not store data.

A cluster typically contains a single primary node and one or more core nodes. However, a cluster can consist of a single Primary node. Optionally, task nodes may be used to help scale the cluster for more processing power. 

Because the size of the sample data set and the processing power required to process the logs is not immense, this lab will use one primary and one core node. It's important to realize that in this lab step, you will create, configure and launch your EMR cluster. When through, you will have a running cluster that can then be configured to process tasks. Amazon offers many different ways to configure your cluster and its workloads.

 

Instructions

1. In the AWS Management Console search bar, enter emr, and click the EMR result under Services:

alt

 

2. Click Create cluster.

You are placed in the Create cluster form.

 

3. For your first EMR cluster, it's best to keep it fairly basic. Fill out the following for each section:

Name and applications

  • Cluster name: CA Cluster
  • Release: emr-5.21.0 (Many of the latest releases are supported.)
  • Application bundle: Core Hadoop (Notice the software and versions that are included with the application package.)

alt

Cluster configuration

  • Select Instance groups
  • Primary and Core instance type: m5.xlarge (A larger instance type is not needed. Too small of an instance will have memory issues when the instance gets bootstrapped.)
  • Click Remove instance group in the Task 1 of 1 section (One primary and one core node will suffice for this lab.)
  • Note: If you changed the instance type to a smaller computed optimized instance (c1.medium) the hardware specifications would be too limited for it to be a primary node and process the task. The cluster may run but not have enough memory to process any sizable workloads. Determining the ideal instance size for your environment is of course beyond the scope of this lab, a skill that some argue is part science, part art.

Attention Warning WARNING : Make sure to remove the task instance group, only have Primary and Core listed, and to select m5.xlarge as the instance type. Other values are not allowed and you WILL be BANNED - read more here.Attention Warning

alt

Cluster scaling and provisioning option

  • Select Set cluster size manually and enter 1 under Size

alt

Cluster logs

  • Select Publish cluster-specific logs to Amazon S3
  • Amazon S3 location: Click Browse S3 and navigate to the logs folder in the S3 bucket you created earlier. The resulting field value will resemble s3://calabs-emr-#/logs (Where # is a number used to guarantee the uniqueness of your S3 bucket name.)
    • Click Choose

alt

Identity and Access Management (IAM) roles

  • Select Choose an existing service role
  • Select RoleForEMR from the Service role drop-down
  • Select Choose an existing instance profile 
  • Select RoleForEC2 from the Service role drop-down

alt

 

 

6. Review the Summary section and click Create cluster when ready to proceed.

alt

alt

 

7. Wait until the Status of the cluster is Waiting:

alt

 

The amount of time for your cluster to be ready for use varies. Several factors can influence timing, including the size and number of instances, and whether debugging is turned on or not. Your cluster will probably take about 10 minutes before it's ready.

 

8. (Optional) If waiting for your cluster to be ready for use, there are two excellent resources worth looking into while you wait:

 

Summary 

Congratulations! You learned how to configure, create, launch and monitor basic status transitions for your cluster. You were also exposed to the many different configuration options available to you when setting up your cluster. 

You have a running Amazon EMR cluster and are ready to give it work to do! 

Validation checks
1Checks
The EMR Cluster Has Been Created

Check that at least one EMR cluster has been created

Amazon Elastic Map Reduce (EMR)