Skip to main content

Big Data: Amazon EMR, Apache Spark and Apache Zeppelin – Part 2 of 2

Amazon EMRIn the first article about Amazon EMR, in our two-part series, we learned to install Apache Spark and Apache Zeppelin on Amazon EMR. We also learned ways of using different interactive shells for Scala, Python, and R, to program for Spark.
Amazon EMR

Let’s continue with the final part of this series. We’ll learn to perform simple data analysis using Scala with Zeppelin.

Access the Zeppelin Notebook

Before we can access the Zeppelin Notebook, we need to forward all requests from localhost:8890 to the master node. This is because port 8890 is bound on the master node, and not on our local machine.

$ ssh -i cloudacademy-keypair.pem -L 8890:ec2-[redacted].compute-1.amazonaws.com:8890 hadoop@ec2-[redacted].compute-1.amazonaws.com -Nv
[...]
Authenticated to ec2-[redacted].compute-1.amazonaws.com ([redacted]:22).
debug1: Local connections to LOCALHOST:8890 forwarded to remote address ec2-[redacted].compute-1.amazonaws.com:8890
debug1: Local forwarding listening on ::1 port 8890.
debug1: channel 0: new [port listener]
debug1: Local forwarding listening on 127.0.0.1 port 8890.

Having done that, we can now access http://localhost:8890/.
Amazon EMR

Analyse the data!

Zeppelin has a clean and intuitive web interface that does not need much explanation to get started. We can start by creating a new note.
We will use a dataset that is hosted on Amazon S3 as an example. The URL to the S3 public bucket is s3://us-east-1.elasticmapreduce.samples/flightdata/input/. This dataset is fairly large. It is around 4GB when it is compressed, and 79GB after uncompression. It is pulled from Amazon’s official blog post New – Apache Spark on Amazon EMR. The dataset originally came from the US’s Department of Transportation and is a good size to play with.
To add text in the notebook, we begin the text with %md.
Amazon EMR
We will read the dataset from a public, read-only S3 bucket to a DataFrame. There are a total of 162,212,419 rows.
Zeppelin 2
We will display the first three records of the dataset. While we are at it, let’s also register the DataFrame as a table so that we can query them with SQL statements.
Zeppelin 3
We can query for the top 10 airports with the most departures since 2000. The top three airports are: Hartsfield–Jackson Atlanta International Airport (ATL), O’Hare International Airport (ORD), and Dallas/Fort Worth International Airport (DFW). Is anyone surprised by these three? I was a little.
Amazon EMR
Next we will query for the top 10 airports with the most flight delays over 15 minutes since 2000, and the top three are: O’Hare International Airport (ORD), Hartsfield–Jackson Atlanta International Airport (ATL), and Dallas/Fort Worth International Airport (DFW). 
Amazon EMR
How about we look at flight delays over 60 minutes instead? We see the same top three airports in the same order again.
Amazon EMR
Let’s look at the top 10 airports with the most flight cancellations. Again, the same top three airports are O’Hare International Airport (ORD), Dallas/Fort Worth International Airport (DFW), and Hartsfield–Jackson Atlanta International Airport (ATL). Maybe it is wise to avoid these airports if we can!
Amazon EMR
And finally, the top 10 most popular flight routes. The top three routes were Los Angeles International Airport (LAX) to McCarran International Airport (LAS), Los Angeles International Airport (LAX) to San Francisco International Airport (SFO), and Los Angeles International Airport (LAX) to San Diego International Airport (SAN).
Amazon EMR

Terminating the EMR cluster

Always remember to terminate your EMR cluster after you have completed your work. As we are running a cluster of machines, we will be billed for using the EMR box per hour as well as the on-demand Linux instances per hour. These charges can add up very quickly especially if you run a large cluster. So to avoid spending more than you should, do terminate your EMR cluster if you do not need to use it.

$ aws emr terminate-clusters --cluster-id j-ABCDEFGHIJKLM
$ aws emr describe-cluster --cluster-id j-ABCDEFGHIJKLM | grep State\"\:
            "State": "TERMINATING",
                    "State": "TERMINATING",
                    "State": "TERMINATING",

What’s next?

In this article, we have learned to read in a large dataset from a S3 public bucket. We have also performed SQL queries on the dataset to answer a few interesting questions (if you live in the US, or have to travel to the US frequently). If you have followed along the examples here, you will soon realise that there is a limitation to this setup. The changes we have made on Zeppelin is only persistent as long as the EMR cluster is running. If we were to terminate EMR, we will also lose the changes on Zeppelin. Zeppelin itself does not support exporting or saving of its notebooks (yet, I hope). Obviously this is not ideal. If you have a suggestion on how we can avoid this problem, I would love to hear from you.
We are only scratching the surface on this topic. I hope it gives you a good starting point to learn more about Amazon EMR. If you are interested to learn more about the other supported projects in EMR, give me your suggestions on what you would like to read in my future blog posts. I am more than happy to learn and share my knowledge with you.

Written by

Eugene Teo is a director of security at a US-based technology company. He is interested in applying machine learning techniques to solve problems in the security domain.

Related Posts

— October 18, 2018

Microservices Architecture: Advantages and Drawbacks

Microservices are a way of breaking large software projects into loosely coupled modules, which communicate with each other through simple Application Programming Interfaces (APIs).Microservices have become increasingly popular over the past few years. The modular architectural style,...

Read more
  • AWS
  • Microservices
— October 2, 2018

What Are Best Practices for Tagging AWS Resources?

There are many use cases for tags, but what are the best practices for tagging AWS resources? In order for your organization to effectively manage resources (and your monthly AWS bill), you need to implement and adopt a thoughtful tagging strategy that makes sense for your business. The...

Read more
  • AWS
  • cost optimization
— September 26, 2018

How to Optimize Amazon S3 Performance

Amazon S3 is the most common storage options for many organizations, being object storage it is used for a wide variety of data types, from the smallest objects to huge datasets. All in all, Amazon S3 is a great service to store a wide scope of data types in a highly available and resil...

Read more
  • Amazon S3
  • AWS
— September 18, 2018

How to Optimize Cloud Costs with Spot Instances: New on Cloud Academy

One of the main promises of cloud computing is access to nearly endless capacity. However, it doesn’t come cheap. With the introduction of Spot Instances for Amazon Web Services’ Elastic Compute Cloud (AWS EC2) in 2009, spot instances have been a way for major cloud providers to sell sp...

Read more
  • AWS
  • Azure
  • Google Cloud
— August 23, 2018

What are the Benefits of Machine Learning in the Cloud?

A Comparison of Machine Learning Services on AWS, Azure, and Google CloudArtificial intelligence and machine learning are steadily making their way into enterprise applications in areas such as customer support, fraud detection, and business intelligence. There is every reason to beli...

Read more
  • AWS
  • Azure
  • Google Cloud
  • Machine Learning
— August 17, 2018

How to Use AWS CLI

The AWS Command Line Interface (CLI) is for managing your AWS services from a terminal session on your own client, allowing you to control and configure multiple AWS services.So you’ve been using AWS for awhile and finally feel comfortable clicking your way through all the services....

Read more
  • AWS
Albert Qian
— August 9, 2018

AWS Summit Chicago: New AWS Features Announced

Thousands of cloud practitioners descended on Chicago’s McCormick Place West last week to hear the latest updates around Amazon Web Services (AWS). While a typical hot and humid summer made its presence known outside, attendees inside basked in the comfort of air conditioning to hone th...

Read more
  • AWS
  • AWS Summits
— August 8, 2018

From Monolith to Serverless – The Evolving Cloudscape of Compute

Containers can help fragment monoliths into logical, easier to use workloads. The AWS Summit New York was held on July 17 and Cloud Academy sponsored my trip to the event. As someone who covers enterprise cloud technologies and services, the recent Amazon Web Services event was an insig...

Read more
  • AWS
  • AWS Summits
  • Containers
  • DevOps
  • serverless
— July 11, 2018

AWS Certification Practice Exam: What to Expect from Test Questions

If you’re building applications on the AWS cloud or looking to get started in cloud computing, certification is a way to build deep knowledge in key services unique to the AWS platform. AWS currently offers nine certifications that cover the major cloud roles including Solutions Archite...

Read more
  • AWS
— June 26, 2018

Disadvantages of Cloud Computing

If you want to deliver digital services of any kind, you’ll need to compute resources including CPU, memory, storage, and network connectivity. Which resources you choose for your delivery, cloud-based or local, is up to you. But you’ll definitely want to do your homework first.Cloud ...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud
— March 13, 2018

Choosing the Right AWS Certification for You and Your Team

As companies increasingly shift workloads to the public cloud, cloud computing has moved from a nice-to-have to a core competency in the enterprise. This shift requires a new set of skills to design, deploy, and manage applications in the cloud.As the market leader and most mature pro...

Read more
  • AWS
  • AWS certifications
— March 7, 2018

How to Encrypt an EBS Volume

Keeping data and applications safe in the cloud is one the most visible challenges facing cloud teams in 2018. Cloud storage services where data resides are frequently a target for hackers, not because the services are inherently weak, but because they are often improperly configured....

Read more
  • AWS
  • encryption