Launch EMR Cluster with Spark and Zeppelin


Distributed Machine Learning Concepts
Course Introduction
AWS Glue
3m 47s
Course Review
1m 26s
Start course
1h 26m

In part 3 we will use launch a new EMR cluster complete with Apache Spark and Apache Zeppelin installed. We configure the EMR cluster to have access to the AWS Glue DataCatalog. The EMR cluster is launched with a Master node and single Core node. We then show you how to setup SSH port forwarding to gain access to the Zeppelin web console from your desktop.


- [Instructor] Welcome back to part three of our demonstration. In this part, we'll launch an Elastic MapReduce Cluster. We'll have the EMR servers automatically install Apache Spark and Apache Zeppelin.

Let's begin with a quick review of what we accomplished in part two of our demonstration. In part two, we worked with the Glue Crawler to crawl our dataset stored in the S3 bucket. We used Athena to create a secondary table and register this into the data catalog. Finally, we created an ETL job to perform ETL on our dataset to prepare it for ingestion into our machine learning algorithm.

Let's now move on and begin the configuration of our EMR cluster. Within the EMR service console, we begin launching our cluster by clicking the Create Cluster button. We're taken onto the Quick Options screen. From here, we'll need to navigate into the advanced options. We do so by clicking the Advanced Options link. Next, under software configuration we put the release of the EMR software that we want.

In our case, we'll go with the latest. We'll also check the tick boxes for Zeppelin and Spark to ensure that these applications are installed. Under the Glue Data Catalog settings, we'll enable both to give our Spark cluster access to the data catalog. We leave the reminder of settings as defaults and click the next button to go to the next screen. Under Hardware Configuration, ensure that the instance group configuration is set to uniform.

Within network, we'll pick one of our VPCs to deploy in. In our case, we'll deploy into a public subnet. We'll increase the size of the root device EBS volume to 50 gig to ensure that we have room to install Apache Spark and Zeppelin. We'll remove the task node and drop the number of instances for the core nodes from two to one. We'll leave the purchasing option as on-demand for both. Again, we click the next button at the bottom of the screen.

This takes us to the general cluster settings. Here we'll see their cluster name to be EMICluster-Spark. We leave the rest of the settings on the screen as defaults. Under bootstrap actions, if we wanted to install some custom scripts, we would do so here. But for now, just leave this as is and click next at the bottom of the screen.

On the final screen, we can set the security options. Here, we choose an SSH key pair. This will give us SSH access onto any of the nodes in our cluster. Under Permissions we configure the IAM roles that will be used within our environment. The first role is used by the EMR service itself. The second is assigned to the nodes within our cluster.

Under the EC2 Security Group section, we can see the security groups assigned to the two different nodes. There is a security group explicitly for the master node and use a security group explicitly for the core and task nodes. Finally, click the Create Cluster button to begin launching our cluster. We can now see that our EMR cluster has begun launching as per the starting status at the top of the screen. The EMR cluster will take approximately five to 10 minutes to complete.

We'll now jump ahead in the demonstration to the point where the EMR cluster is ready as can be seen here by the status. Next, we'll take a copy of the public DNS record assigned to the master node. We'll use this to SSH onto the master node box itself but before we can do this, we need to go to the EC2 security group assigned to the master node and allow inbound access for port 22 or SSH.

Let's head over to the EC2 console where we'll see the two instances that have been launched as part of our cluster, one instance for the master node and one instance for the core node. We'll need to determine which one's which. We can do so my looking at the tags. The EMR service will tag both instances based on the role they play within the cluster. Here we can see under Tags that the first instance has the tag of Master.

This is the master node within the EMR cluster. For the second instance, if we take a look at tags, we can see that it has the tag Core, this is the core node within the EMR cluster. Going back to the master node, under Description, we'll want to click on the Security Group for it, in this case the ElasticMapReduce-master security group. Let's open this security group up in a new browser tab and ensure that the there is an inbound rule for port 22 traffic.

Here we'll update the port 22 rule to allow inbound traffic from my current public IP address. We'll save this rule into our security group and head back into the EMR service console for our cluster. We'll take a copy of the master public DNS record assigned to our master node and swap over into our terminal session where we'll attempt to form an SSH connection to the EMR master node. We'll start a fresh terminal session and then type the SSH command to give us access onto the master node.

So, we'll type ssh -v for verbose -i pointing to our SSH key, Hadoop as the user at and then the public DNS record that was assigned to the master node. If our network connectivity in security group configuration are all in order, we should gain SSH access onto the master node. Here we can see that that is indeed the case. We can see that we have successfully logged in as the Hadoop user and that the private IP address for the master node is

Let's now have a look at the running processes on the master node and ensure that the Zeppelin application or process is up and running. So, we'll search for Zeppelin and there we can see that the Zeppelin process is actually up and running. This represents the process which serves up the interactive notebook application. Let's confirm the port that Zeppelin is being configured to listen on. We know by default that this port will be 8890, so let's go ahead and confirm this.

And again, there we can confirm that Zeppelin application is actually listening on 8890. Let's do a final test on this box. We'll curl to the localhost to port 8890. We expect to get back some HTML that represents the home page for the Zeppelin application. Again, this has worked, so we definitely know that Zeppelin is listening on port 8890 and that the process is up and running. At this stage we're ready to set up SSH port forwarding for the Zeppelin port 8890 from the desktop to the master node in our EMR cluster.

Let's go ahead and do this now. In our terminal, we'll start a new terminal shell and then we'll set up our SSH port forwarding rule. The requirement for SSH port forwarding is to allow us to pull up a browser on our local machine and be able to browse to localhost port 8890 and have that traffic forwarded to the Zeppelin application running on the master node. Again, we'll need to take a copy of the public DNS record assigned to the master node in our EMR cluster. This is where we'll do the port forwarding to.

The connection uses the -f parameter which puts the SSH connection into the background and keeps it running. Okay, that looks like it's successfully connected. We'll now have a look at the processes running on the desktop to ensure that the SSH port forwarding connection has indeed been launched successfully. As you can see, the SSH command that we just executed is indeed running in the background as highlighted here.

Let's now swap over to the browser running on our desktop and see whether we can pull up the Zeppelin application by browsing to http://localhost port 8890 and there we go, this is the Zeppelin Notebook application running on the master node in our EMR cluster.

Okay, that concludes the third part of this demonstration. Go ahead and close this lecture and we'll see you shortly in the next one.

About the Author
Learning Paths

Jeremy is a Content Lead Architect and DevOps SME here at Cloud Academy where he specializes in developing DevOps technical training documentation.

He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 25+ years. In recent times, Jeremy has been focused on DevOps, Cloud (AWS, Azure, GCP), Security, Kubernetes, and Machine Learning.

Jeremy holds professional certifications for AWS, Azure, GCP, Terraform, Kubernetes (CKA, CKAD, CKS).