Adding a Step to your running Cluster

Lab Steps

lock
Logging In to the Amazon Web Services Console
lock
Creating an S3 Bucket for EMR
lock
Creating an EMR Cluster
lock
Adding a Step to your running Cluster
lock
Viewing the EMR Cluster and Step Results
lock
Terminating and Cloning a Cluster
lock
Adding a new Step for a Cloned EMR Cluster to Process
Need help? Contact our support team

Here you can find the instructions for this specific Lab Step.

If you are ready for a real environment experience please start the Lab. Keep in mind that you'll need to start from the first step.

Introduction

In this lab step, you will use an example Hive script that processes Amazon CloudFront (CF) log files. The script looks at the CF logs, identifies the different operating systems (OS) that make requests through CF, and tabulates how many for each OS. The hive script and the sample CF logs are made available on a public S3 bucket.

Instructions

1. Click on your running cluster name (CA cluster) from the Clusters section of the EMR console. A summary of the cluster is shown, along with several possible action buttons:

alt

 

2. Click Steps > Add step in order to submit the example script for processing by your EMR cluster:

alt

 

3. Select Hive program in the Step type drop-down menu. The context changes based on the Step type:

alt

 

4. Fill out the remainder of the Add Step dialog as follows:

  • Name: Enter AWS Hive example to process CF logs
  • Script S3 location: s3://us-west-2.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q
  • Input S3 location: s3://us-west-2.elasticmapreduce.samples
  • Output S3 location: s3://calabs-emr#/output/ (Don't forget to include the number that makes your S3 bucket unique, and the trailing "output/" folder of the bucket.)
  • Arguments: -hiveconf hive.support.sql11.reserved.keywords=false
  • Action on failure: Select Continue (If the processing fails continuing will allow basic debugging, reconfiguration, and additional attempts.)

Tip: Take your time filling out the fields, it's pretty easy to misconfigure the settings above. For example, the incorrect path to the S3 bucket, forgetting to create or include the output folder, doubling up on the s3:// protocol in the path, etc. Your Add step dialog should look similar to:

alt

 

5. Click Add when ready to proceed.

 

6. Return to Clusters. The Status changes from Waiting - Cluster ready to Running:

alt

 

The Status change may take a little while (~10 seconds).

 

7. Select the running cluster name (CA Cluster). The summary information shows the cluster status Running and the Running step as well:

alt

 

8. Return to the Cluster list, then expand the running cluster. Take note of both the Cluster and Step Status:

alt

Refresh and re-expand the running cluster while observing the Step Status. Although it could vary depending on timing, you should see it transition from Pending, to Running, and finally to Completed.

The Step should complete within about 1 minute and the Status of the cluster return to Waiting - Cluster ready.

 

Summary

In this lab step, you submitted a Step for processing by your running EMR cluster. The code was a Hive script provided by Amazon, along with example CloudFront logs. Processing log files is a very common Big Data use case. In this Lab Step the example logs were CloudFront, but they could have been another AWS service such as CloudTrail, or Apache HTTP access logs, or system uptime logs, or your custom application logs, etc. Similarly, the example Hive script could have been a Pig program, a streaming Python program, or a custom Java application (JAR file).Â