Big Data: Amazon EMR, Apache Spark, and Apache Zeppelin – Part 1 of 2

Amazon EMR (Elastic MapReduce) provides a platform to provision and manage Amazon EC2-based data processing clusters.

Amazon EMR clusters are installed with different supported projects in the Apache Hadoop and Apache Spark ecosystems. You can either choose to install from a predefined list of software, or pick and choose the ones that make the most sense for your project.

In this article, the first in a two-part series, we will learn to set up Apache Spark and Apache Zeppelin on Amazon EMR using AWS CLI (Command Line Interface). We will also run Spark’s interactive shells to test if they work properly.

What is Apache Spark?

Amazon EMR

Apache Spark is the first non-Hadoop-based engine that is supported on EMR. Known to be more efficient than Hadoop, Spark can run complex computations in memory. It also supports different types of workloads including batch processing and near real-time streaming.

What is Apache Zeppelin?

Amazon EMR

Apache Zeppelin is a web-based notebook for data analysis, visualisation and reporting. Zeppelin lets you perform data analysis interactively and view the outcome of your analysis visually. It supports the Scala functional programming language with Spark by default. If you have used Jupyter Notebook (previously known as IPython Notebook) or Databricks Cloud before, you will find Zeppelin familiar.

Our assumptions

  • We will assume that the AWS CLI tools have been installed.
  • We will also assume that an IAM (Identity and Access Management) user has been created with AmazonElasticMapReduceFullAccess managed policy attached to it, and that CLI has been configured to use its access key ID and secret access key. This policy gives CLI full access to EMR.
  • Make sure that CLI is configured to use the us-east-1 (N. Virginia) region by default as the dataset that we will use in our next article, is hosted on Amazon S3 in that region.
  • And finally, we will assume that a key pair has been created so that we can SSH into the master node, if necessary.

Creating an EMR cluster

We can easily set up an EMR cluster by using the aws emr create-cluster command.

We will use the latest EMR release 4.3.0. We will install both Spark 1.6.0 and Zeppelin-Sandbox 0.5.5. Using --ec2-attributes KeyName= lets us specify the key pair we want to use to SSH into the master node.

Let’s use one master node and two core nodes of m3.xlarge EC2 instance types. Our data analysis work will be distributed to these core nodes.

There are many other options available and I suggest you take a look at some of the other solutions using aws emr create-cluster help.

Waiting for the cluster to start

After issuing the aws emr create-cluster command, it will return to you the cluster ID. This cluster ID will be used in all our subsequent aws emr commands.

You can view the details of the cluster using the aws emr describe-cluster command.

We are more interested in the state of the cluster and its nodes. It will take some time for the cluster to be provisioned.

When the provisioning is completed, the Spark cluster should be WAITING for steps to run, and the master and core nodes should indicate that they are RUNNING.

SSH to the master node

Now we can connect to the master node from remote. Instead of running ssh directly, we can issue the aws emr ssh command. It will automatically retrieve the master node’s hostname.

Spark’s Scala shell

We will not cover the Spark programming model in this article but we will learn just enough to start an interpreter on the command-line and to make sure it work.

Spark supports Scala, Python and R. We can choose to write them as standalone Spark applications, or within an interactive interpreter.

For Scala, we can use the spark-shell interpreter.

To make sure that everything works, issuing both sc and sqlContext should return to you the addresses to the respective objects.

Spark’s Python shell

For fellow Pythonistas, we can use pyspark instead. The Spark APIs for all the supported languages will be similar.

Spark’s R shell

And for R developers, you can use sparkR.

Terminating the EMR cluster

Always remember to terminate your EMR cluster after you have completed your work!

What’s next?

We have learned to install Spark and Zeppelin on EMR. I also showed you some of the options for using different interactive shells for Scala, Python, and R. These development shells are a quick way to test if your setup is working properly. Anyone who is new to Spark, or would like to experiment with small snippet of code can use these shells to test code interactively. If you have programmed in either one of these three languages before, it is very likely that you would have used an interactive shell before. The experience should be the same.

Of course, this is not the only way to develop for the Spark. In our next article, we will learn to use Zeppelin to develop code interactively on the web browser. We will look at a simple data analysis example using Scala. I welcome your comments and questions, and will do my best to integrate them into the next article if you post in time. Chandan Patra published a related post back in November, Amazon EMR: five ways to improve the way you use Hadoop that you will find useful and interesting.

Eugene Teo

Eugene Teo

Eugene Teo is a cybersecurity professional at a technology company. He is also a blogger at Cloud Academy, an adjunct lecturer at a university, and an co-organiser of PyData Singapore. He hopes to apply his cybersecurity domain knowledge in data science and engineering to build something interesting. He occasionally writes about his learning journey at his personal blog.

More Posts - Website

Follow Me:
TwitterLinkedIn