Adding a Step to your running Cluster
Lab Steps
Introduction
In this lab step, you will use an example Hive script that processes Amazon CloudFront (CF) log files. The script looks at the CF logs, identifies the different operating systems (OS) that make requests through CF, and tabulates how many for each OS. The hive script and the sample CF logs are made available on a public S3 bucket.
Instructions
1. Click on your running cluster name (CA cluster) from the Clusters section of the EMR console. A summary of the cluster is shown, along with several possible action buttons:
Â
2. Click Steps > Add step in order to submit the example script for processing by your EMR cluster:
Â
3. Select Hive program in the Step type drop-down menu. The context changes based on the Step type:
Â
4. Fill out the remainder of the Add Step dialog as follows:
- Name: Enter AWS Hive example to process CF logs
- Script S3 location:Â s3://us-west-2.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q
- Input S3 location:Â s3://us-west-2.elasticmapreduce.samples
- Output S3 location: s3://calabs-emr#/output/Â (Don't forget to include the number that makes your S3 bucket unique, and the trailing "output/" folder of the bucket.)
- Arguments:Â -hiveconf hive.support.sql11.reserved.keywords=false
- Action on failure: Select Continue (If the processing fails continuing will allow basic debugging, reconfiguration, and additional attempts.)
Tip: Take your time filling out the fields, it's pretty easy to misconfigure the settings above. For example, the incorrect path to the S3 bucket, forgetting to create or include the output folder, doubling up on the s3:// protocol in the path, etc. Your Add step dialog should look similar to:
Â
5. Click Add when ready to proceed.
Â
6. Return to Clusters. The Status changes from Waiting - Cluster ready to Running:
Â
The Status change may take a little while (~10 seconds).
Â
7. Select the running cluster name (CA Cluster). The summary information shows the cluster status Running and the Running step as well:
Â
8. Return to the Cluster list, then expand the running cluster. Take note of both the Cluster and Step Status:
Refresh and re-expand the running cluster while observing the Step Status. Although it could vary depending on timing, you should see it transition from Pending, to Running, and finally to Completed.
The Step should complete within about 1 minute and the Status of the cluster return to Waiting - Cluster ready.
Â
Summary
In this lab step, you submitted a Step for processing by your running EMR cluster. The code was a Hive script provided by Amazon, along with example CloudFront logs. Processing log files is a very common Big Data use case. In this Lab Step the example logs were CloudFront, but they could have been another AWS service such as CloudTrail, or Apache HTTP access logs, or system uptime logs, or your custom application logs, etc. Similarly, the example Hive script could have been a Pig program, a streaming Python program, or a custom Java application (JAR file).Â