Big Data: Amazon EMR, Apache Spark and Apache Zeppelin – Part 2 of 2

Amazon EMRIn the first article about Amazon EMR, in our two-part series, we learned to install Apache Spark and Apache Zeppelin on Amazon EMR. We also learned ways of using different interactive shells for Scala, Python, and R, to program for Spark.
Amazon EMR

Let’s continue with the final part of this series. We’ll learn to perform simple data analysis using Scala with Zeppelin.

Access the Zeppelin Notebook

Before we can access the Zeppelin Notebook, we need to forward all requests from localhost:8890 to the master node. This is because port 8890 is bound on the master node, and not on our local machine.

$ ssh -i cloudacademy-keypair.pem -L 8890:ec2-[redacted].compute-1.amazonaws.com:8890 hadoop@ec2-[redacted].compute-1.amazonaws.com -Nv
[...]
Authenticated to ec2-[redacted].compute-1.amazonaws.com ([redacted]:22).
debug1: Local connections to LOCALHOST:8890 forwarded to remote address ec2-[redacted].compute-1.amazonaws.com:8890
debug1: Local forwarding listening on ::1 port 8890.
debug1: channel 0: new [port listener]
debug1: Local forwarding listening on 127.0.0.1 port 8890.

Having done that, we can now access http://localhost:8890/.

Amazon EMR

Analyze the data!

Zeppelin has a clean and intuitive web interface that does not need much explanation to get started. We can start by creating a new note.

We will use a dataset that is hosted on Amazon S3 as an example. The URL to the S3 public bucket is s3://us-east-1.elasticmapreduce.samples/flightdata/input/. This dataset is fairly large. It is around 4GB when it is compressed, and 79GB after uncompression. It is pulled from Amazon’s official blog post, New – Apache Spark on Amazon EMR. The dataset originally came from the US’s Department of Transportation and is a good size to play with.

To add text in the notebook, we begin the text with %md.

Amazon EMR
We will read the dataset from a public, read-only S3 bucket to a DataFrame. There are a total of 162,212,419 rows.

Zeppelin 2
We will display the first three records of the dataset. While we are at it, let’s also register the DataFrame as a table so that we can query them with SQL statements.

Zeppelin 3
We can query for the top 10 airports with the most departures since 2000. The top three airports are Hartsfield–Jackson Atlanta International Airport (ATL), O’Hare International Airport (ORD), and Dallas/Fort Worth International Airport (DFW). Is anyone surprised by these three? I was a little.

Amazon EMR
Next, we will query for the top 10 airports with the most flight delays over 15 minutes since 2000, and the top three are: O’Hare International Airport (ORD), Hartsfield–Jackson Atlanta International Airport (ATL), and Dallas/Fort Worth International Airport (DFW).

Amazon EMR
How about we look at flight delays over 60 minutes instead? We see the same top three airports in the same order again.

Amazon EMR
Let’s look at the top 10 airports with the most flight cancellations. Again, the same top three airports are O’Hare International Airport (ORD), Dallas/Fort Worth International Airport (DFW), and Hartsfield–Jackson Atlanta International Airport (ATL). Maybe it is wise to avoid these airports if we can!

Amazon EMR
And finally, the top 10 most popular flight routes. The top three routes were Los Angeles International Airport (LAX) to McCarran International Airport (LAS), Los Angeles International Airport (LAX) to San Francisco International Airport (SFO), and Los Angeles International Airport (LAX) to San Diego International Airport (SAN).

Amazon EMR

Terminating the EMR cluster

Always remember to terminate your EMR cluster after you have completed your work. As we are running a cluster of machines, we will be billed for using the EMR box per hour as well as the on-demand Linux instances per hour. These charges can add up very quickly especially if you run a large cluster. So to avoid spending more than you should, do terminate your EMR cluster if you do not need to use it.

$ aws emr terminate-clusters --cluster-id j-ABCDEFGHIJKLM
$ aws emr describe-cluster --cluster-id j-ABCDEFGHIJKLM | grep State\"\:
            "State": "TERMINATING",
                    "State": "TERMINATING",
                    "State": "TERMINATING",

What’s next?

In this article, we have learned to read in a large dataset from an S3 public bucket. We have also performed SQL queries on the dataset to answer a few interesting questions (if you live in the US, or have to travel to the US frequently). If you have followed along the examples here, you will soon realize that there is a limitation to this setup. The changes we have made on Zeppelin is only persistent as long as the EMR cluster is running. If we were to terminate EMR, we will also lose the changes on Zeppelin. Zeppelin itself does not support exporting or saving of its notebooks (yet, I hope). Obviously, this is not ideal. If you have a suggestion on how we can avoid this problem, I would love to hear from you.

We are only scratching the surface on this topic. I hope it gives you a good starting point to learn more about Amazon EMR. If you are interested to learn more about the other supported projects in EMR, give me your suggestions on what you would like to read in my future blog posts. I am more than happy to learn and share my knowledge with you.

Avatar

Written by

Eugene Teo

Eugene Teo is a director of security at a US-based technology company. He is interested in applying machine learning techniques to solve problems in the security domain.


Related Posts

Vijayakumar Athithan
Vijayakumar Athithan
— March 27, 2020

What is Cognito in AWS?

Web applications usually allow a valid username and password combination for successful sign in to the application. Modern authentication flows incorporate more approaches to ensure user authentication. When using AWS, this is no exception, thanks to the abilities and features offered b...

Read more
  • AWS
  • AWS Cognito
  • Solutions Architect
Connie Benton
Connie Benton
— March 25, 2020

How To Build a Career with AWS Certifications

From Iaas and PaaS solutions to digital marketing, cloud computing reshapes the world of technology. As the influence of this technology grows, so does investment. Tens of billions of dollars are being spent on cloud computing-related services each year. This influx is continuing to inc...

Read more
  • AWS
  • Certifications
Avatar
Andrew Larkin
— March 20, 2020

The 12 AWS Certifications: Which is Right for You and Your Team?

As companies increasingly shift workloads to the public cloud, cloud computing has moved from a nice-to-have to a core competency in the enterprise. This shift requires a new set of skills to design, deploy, and manage applications in cloud computing. As the market leader and most ma...

Read more
  • AWS
  • AWS Certifications
Alisha Reyes
Alisha Reyes
— March 17, 2020

Cloud Academy’s Blog Digest: How Do AWS Certifications Increase Your Employability, How to Become a Microsoft Certified Azure Data Engineer, and more

With everything going on right now, it's likely that the only thing you've been reading lately is related to the coronavirus pandemic. It's important to stay informed during these times, but it's also good to jump into something that can take your mind off of the current situation for j...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • programming
  • Security
Avatar
Cloud Academy Team
— March 13, 2020

Which Certifications Should I Get?

As we mentioned in an earlier post, the old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and compan...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— March 7, 2020

New on Cloud Academy: Intro to GitOps; AWS Courses; Java, Python, Amazon Linux 2, Ubuntu, & Docker Playgrounds; and much more

New Lab Playgrounds This month, our Content Team released six new "playground labs." Our playground labs provide a safe and secure sandbox environment for you to explore your own ideas, follow along with Cloud Academy courses, or answer your own questions — all without having to instal...

Read more
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Alisha Reyes
Alisha Reyes
— March 6, 2020

New on Cloud Academy: Intro to GitOps; AWS Courses; Java, Python, Amazon Linux 2, Ubuntu, & Docker Playgrounds; and much more

New Lab Playgrounds This month, our Content Team released six new "playground labs." Our playground labs provide a safe and secure sandbox environment for you to explore your own ideas, follow along with Cloud Academy courses, or answer your own questions — all without having to instal...

Read more
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Patrick Navarro
Patrick Navarro
— March 4, 2020

AWS Certifications: How Do They Increase Your Employability and Progress Your Career?

AWS certifications are no walk in the park. They’re designed to validate in-depth, specialist knowledge and comprehensive experience, often requiring months of dedicated studying to earn even for those already working with the cloud platform. But the rewards that AWS professionals ca...

Read more
  • AWS
  • AWS certification
  • certification
Avatar
Chandan Patra
— February 21, 2020

Elasticsearch vs. CloudSearch: AWS Cloud Search Choices

Elasticsearch vs. CloudSearch: What's the main difference? Let's compare AWS-based cloud tools: Elasticsearch vs. CloudSearch. While both services use proven technologies, Elasticsearch is more popular, open source, and has a flexible API to use for customization; in comparison, CloudS...

Read more
  • AWS
  • Azure
  • cloudsearch
  • elasticsearch
Avatar
Andrew Larkin
— February 13, 2020

Cloud Academy Content Roadmap Updates

Welcome to our Q1 2020 roadmap. This is the content we plan to build over the next three months, between February 1 - and April 30, 2020. Let's look at some of our roadmap highlights. Atlassian Bamboo for CI/CD We had a lot of requests for practical guides on how to apply DevOps tool...

Read more
  • Artificial Intelligence
  • AWS
  • Azure
  • Docker
  • Google Cloud Platform
  • Kubernetes
  • Machine Learning
Alisha Reyes
Alisha Reyes
— February 7, 2020

New on Cloud Academy: Git Labs, CKA and CKAD Lab Challenges, AWS and Azure Learning Paths, AGILE, and Much More

We just kicked off our first Free Weekend of 2020. This means we've unlocked our Training Library for just 72 hours. Until Sunday at 11:59 pm (PST), you can get unlimited access to our industry-leading learning paths, courses, certification prep exams, and our most popular hands-on labs...

Read more
  • agile
  • AWS
  • Azure
  • Google Cloud Platform
  • Linux
  • OWASP
  • programming
  • red hat
  • scrum
Avatar
Stuart Scott
— February 6, 2020

How to Encrypt an EBS Volume

Keeping data and applications safe in the cloud is one of the most visible challenges facing cloud teams in 2020. Cloud storage services where data resides are frequently a target for hackers, not because the services are inherently weak but because they are often improperly configured....

Read more
  • AWS
  • EBS
  • Encryption