Introduction

An Amazon EMR cluster is comprised of EC2 instances. AWS deploys and manages the instances in an EMR cluster, handling much of the complexity of server management for you. Depending on the size of the data set you need to process and the time you need to access results, the number and size of the instances will vary.

A cluster has three types of nodes:

Primary - Manages the cluster and all software components needed to distribute both data and tasks to other nodes. The primary also tracks the status of tasks and the overall health of the cluster.
Core - A node that includes software to run tasks and store data in a Hadoop Distributed File System (HDFS).
Task - A node that includes software to run tasks but does not store data.

A cluster typically contains a single primary node and one or more core nodes. However, a cluster can consist of a single Primary node. Optionally, task nodes may be used to help scale the cluster for more processing power.

In this lab, you will use a small sample data set. Because the processing power required is not large, this lab's cluster will comprise of two nodes, one primary and one core.

In this lab step, you will configure and create an Amazon EMR cluster that is ready to process tasks.

Instructions

1. In the AWS Management Console search bar, enter EMR, and click the EMR result under Services:

2. To begin creating a cluster, click Create cluster:

You will see a form page load titled Create cluster.

Note: When configuring your cluster, leave any options not specified in the instructions at their defaults.

3. In the Name and applications section, enter and select the following:

Name: ca-labs-cluster
Amazon EMR release: Select the latest release beginning with emr-6.
Application bundle: Select Core Hadoop

The Amazon EMR release option allows you to select a version of Amazon EMR to use. AWS periodically releases new versions that include updated frameworks and software libraries. This lab has been developed for the major version 6 of Amazon EMR. By targeting a major version, you can be confident that your processing tasks will be supported when you run them over time on new releases.

The Application bundle option allows you to configure the frameworks to be installed on an Amazon EMR cluster. Most of these frameworks are open-source and have different use cases. Core Hadoop is the most basic and can be used for the general processing of large data sets. It includes HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), and MapReduce.

4. In the Cluster configuration section, enter and select the following:

Ensure Instance groups is selected at the top
Instance groups:
- Primary and Core:
  - Choose EC2 instance type: Enter and select m3.xlarge
- Task 1 of 1:
  - Click Remove instance group

Warning: Ensure that the Task 1 of 1 instance type has been removed and that the instance types of the primary and core nodes are m3.xlarge. Using different values may result in your account being temporarily or permanently banned from Cloud Academy labs. To learn more about Cloud Academy's lab restrictions, see this What Are the Restrictions on Labs? page.

Choosing the correct instance sizes for your workload is important because failing to do so can result in unnecessary expense. Larger instance types have more CPUs, memory, and storage. To view the latest pricing for Amazon EMR-compatible Amazon EC2 instances, visit the Amazon EMR pricing page.

Larger instances will produce results quicker than smaller instances whilst costing more. Where you should decide to make this trade-off is dependent on your workload and your organization's requirements.

5. In the Cluster scaling and provisioning section, ensure that the following is entered and selected:

Choose an option: Set cluster size manually
Instance(s) size: 1

This section allows you to configure automatic scaling. Amazon EMR supports managed scaling where AWS will monitor the cluster's resource usage and automatically scale the cluster in and out in response. Custom automatic scaling allows you to specify a scaling policy that scales based on specific metrics of your choice.

In this lab, manual scaling is appropriate.

Here is an overview of some other fields you can configure on this page:

Networking

In this section, you can configure the Amazon Virtual Private Cloud (VPC) and subnet that your cluster is deployed into. This gives you control over the network configuration for your cluster.

Steps

Steps are tasks that can be run on the cluster. As well as configuring them at cluster creation time, you can also submit new steps to a running cluster. You will do this later in the lab.

Cluster termination

Amazon EMR clusters support a feature called termination protection. When turned on, you will be asked to turn termination protection off when trying to delete a cluster. You will see a similar message when trying to delete a cluster with termination protection turned on using the AWS command-line interface (CLI) or application programming interface (API).

An Amazon EMR cluster can be either short-lived or long-lived. When keeping costs low, an organization may opt to create Amazon EMR clusters on demand and delete them after use. In this scenario, you may want to disable termination protection to help automate cluster deletion. When a cluster is long-lived, it's possible that there may be data on the cluster that should be recovered before termination. Termination protection can help prevent accidental or mistaken cluster terminations.

Bootstrap actions

This section allows you to add custom tasks to be run on the cluster during creation. The purpose is to enable you to install custom frameworks, libraries, and applications. The code to perform these actions is loaded from an Amazon S3 bucket similarly to how a step is added to perform data processing.

6. Scroll down to the Cluster logs section and under Amazon S3 location, click Browse S3:

You will see a dialog box titled Choose Amazon S3 location appear.

7. Select the bucket you created earlier and click Choose:

8. To specify a prefix to store logs in, enter /logs to the end of the bucket location in the Amazon S3 location textbox:

In a non-lab environment, you may have a persistent bucket setup to store logs so that they are separate from the bucket you use to store scripts, code, and data.

9. Scroll down to the Identity and Access Management (IAM) roles section and enter and select the following:

Amazon EMR service role:
- Ensure Choose an existing service role is selected
- Service role: Select the role named AmazonEMR-ServiceRole-Lab
EC2 instance profile for Amazon EMR:
- Ensure Choose an existing instance profile is selected
- Instance profile: Select the role named AmazonEMR-InstanceProfile-Lab

You also have the option of configuring a role for custom automatic scaling. In this lab, you are using manual scaling and this role is unnecessary.

The service role is used by the Amazon EMR service to provision Amazon EC2 instances and other required resources. To see an example of the permissions required by the service role, visit the Service role for Amazon EMR (EMR role) page of Amazon EMR Management Guide.

The EC2 instance profile is the role used by the EC2 instances in the cluster. You can think of this role as a job role. To follow the best practice of the principle of least privilege, this role should have only the permissions required by a task running in the cluster.

10. To finish configuring your cluster, on the right-hand side of the page, review the Summary section and click Create cluster:

You will see a page display that shows a Summary and other details of your cluster.

Note: At the top, you will see notifications about your cluster and you may see one about the Amazon EMR service, you can dismiss these by clicking X on the right-hand side of the notification.

Initially, the Status field in the Status and time section will report Starting:

It typically takes ten minutes for an Amazon EMR cluster to become ready and available for use.

While waiting for the cluster to become available to do work, feel free to review the Overview of Amazon EMR page of the Amazon EMR Management Guide.

Once you see the Status field report Waiting, your cluster is ready:

Summary

In this lab step, you learned how to configure and create a new Amazon EMR cluster.