Skip to main content

Big Data: Getting Started with Hadoop, Sqoop & Hive

(Update) We’ve recently uploaded new training material on Big Data using services on Amazon Web Services, Microsoft Azure, and Google Cloud Platform on the Cloud Academy Training Library. On top of that, we’ve been busy adding new content on the Cloud Academy blog on how to best train yourself and your team on Big Data.


This is a guest post by 47Line Technologies

We live in the Data Age! The web has been growing rapidly in size as well as scale during the last 10 years and shows no signs of slowing down. Statistics show that every passing year more data gets generated than all the previous years combined. Moore’s law not only holds true for hardware but for data being generated too! Without wasting time for coining a new phrase for such vast amounts of data, the computing industry decided to just call it, plain and simple, Big Data.

Apache Hadoop is a framework that allows for the distributed processing of such large data sets across clusters of machines. At its core, it consists of 2 sub-projects – Hadoop MapReduce and Hadoop Distributed File System (HDFS). Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. The logical question arises – How do we set up a Hadoop cluster?

Map Reduce Archirecture
Figure 1: Map Reduce Architecture

 

Installation of Apache Hadoop 1.x

We will proceed to install Hadoop on 3 machines. One machine, the master, is the NameNode & JobTracker and the other two, the slaves, are DataNodes & TaskTrackers.
Prerequisites

  1. Linux as development and production platform (Note: Windows is only a development platform. It is not recommended to use in production)
  2. Java 1.6 or higher, preferably from Sun, must be installed
  3. ssh must be installed and sshd must be running
  4. From a networking standpoint, all the 3 machines must be pingable from one another

Before proceeding with the installation of Hadoop, ensure that the prerequisites are in place on all the 3 machines. Update /etc/hosts on all machines so as to enable references as master, slave1 and slave2.

Download and Installation
Download Hadoop 1.2.1. Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster.
Configuration Files
The below mentioned files need to be updated:
conf/hadoop-env.sh
On all machines, edit the file conf/hadoop-env.sh to define JAVA_HOME to be the root of your Java installation. The root of the Hadoop distribution is referred to as HADOOP_HOME. All machines in the cluster usually have the same HADOOP_HOME path.
conf/masters
Update this file on master machine alone with the following line:
master
conf/slaves
Update this file on master machine alone with the following lines:
slave1
slave2
conf/core-site.xml
Update this file on all machines:

<property>
	<name>fs.default.name</name>
	<value>hdfs://master:54310</value>
</property>

conf/mapred-site.xml
Update this file on all machines:

<property>
	<name>mapred.job.tracker</name>
	<value>master:54311</value>
</property>

conf/hdfs-site.xml
The default value of dfs.replication is 3. Since there are only 2 DataNodes in our Hadoop cluster, we update this value to 2. Update this file on all machines –

<property>
	<name>dfs.replication</name>
	<value>2</value>
</property>

After changing these configuration parameters, we have to format HDFS via the NameNode.

bin/hadoop namenode -format

We start the Hadoop daemons to run our cluster.
On the master machine,

bin/start-dfs.sh
bin/start-mapred.sh

Your Hadoop cluster is up and running!

Loading Data into HDFS

Data stored in databases and data warehouses within a corporate data center has to be efficiently transferred into HDFS. Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. The input to the import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this import process is a set of files containing a copy of the imported table.

Sqoop architecture
Figure 2: Sqoop architecture

Prerequisites

  1. A working Hadoop cluster

Download and Installation
Download Sqoop 1.4.4. Installing Sqoop typically involves unpacking the software on the NameNode machine. Set SQOOP_HOME and add it to PATH.
Let’s consider that MySQL is the corporate database. In order for Sqoop to work, we need to copy mysql-connector-java-<version>.jar into SQOOP_HOME/lib directory.
Import data into HDFS
As an example, a basic import of a table named CUSTOMERS in the cust database:

sqoop import --connect jdbc:mysql://db.foo.com/cust --table CUSTOMERS

On successful completion, a set of files containing a copy of the imported table is present in HDFS.

Analysis of HDFS Data

Now that data is in HDFS, it’s time to perform analysis on the data and gain valuable insights.
During the initial days, end users have to write map/reduce programs for simple tasks like getting raw counts or averages. Hadoop lacks the expressibility of popular query languages like SQL and as a result users ended up spending hours (if not days) to write programs for typical analysis.

Enter Hive!

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive was created to make it possible for analysts with strong SQL skills (but meager Java programming skills) to run queries on the huge volumes of data to extract patterns and meaningful information. It provides an SQL-like language called HiveQL while maintaining full support for MapReduce. Any HiveQL query is divided into MapReduce tasks which run on the robust Hadoop framework.

Hive Architecture
Figure 3: Hive Architecture

Prerequisites

  1. A working Hadoop cluster

Download and Installation
Download Hive 0.11. Installing Hive typically involves unpacking the software on the NameNode machine. Set HIVE_HOME and add it to PATH.
In addition, you must create /tmp and /user/hive/warehouse (a.k.a. hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before you can create a table in Hive. The commands are listed below –

$HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

Another important feature of Sqoop is that it can import data directly into Hive.

sqoop import --connect jdbc:mysql://db.foo.com/cust --table CUSTOMERS --hive-import

The above command created a new Hive table CUSTOMERS and loads it with the data from the corporate database. It’s time to gain business insights!
select count (*) from CUSTOMERS;

Apache Hadoop, along with its ecosystem, enables us to deal with Big Data in an efficient, fault-tolerant and easy manner!

Verticals such as airlines have been collecting data (eg: flight schedules, ticketing inventory, weather information, online booking logs) which reside in disparate machines. Many of these companies do not yet have systems in place to analyze these data points at a large scale. Hadoop and its vast ecosystem of tools can be used to gain valuable insights, from understanding customer buying patterns to upselling certain privileges to an increasing ancillary revenue, to stay competitive in today’s dynamic business landscape.
References

  • http://hadoop.apache.org/
  • http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
  • http://sqoop.apache.org/
  • http://hive.apache.org/
This is the first in the series of Hadoop articles. In subsequent posts, we will deep dive into Hadoop and its related technologies. We will talk about real-world use cases and how we used Hadoop to solve these problems! Stay tuned.

Written by

47Line Technologies

47Line is building solutions solving critical business problems using “cloud as the backbone”. The team has been working in Cloud Computing domain for last 6 years and have proven thought leadership in Cloud, Big Data technologies.

Related Posts

Stefano Bellasio
— April 26, 2018

Top Cloud Skills in Demand for 2018: Big Data, AI, Machine Learning

Cloud is a pathway to innovation. Where yesterday’s cloud deployments were about moving an on-premises infrastructure in your data center to a cloud environment, companies today are using cloud platforms to build new features for their products and services that are integrated at a soft...

Read more
  • Big Data
  • GDPR
  • Machine Learning
Cloud Academy Team
— November 22, 2017

November ’17 New on Cloud Academy: DC/OS, Serverless, Security, Big Data, and more

Explore the newest learning paths, courses, and hands-on labs on Cloud Academy in November.Learning PathsIntroduction to DC/OSIn an enterprise environment, running multiple workload types simultaneously can be both difficult and costly, especially when servers aren’t being used ...

Read more
  • AWS
  • Azure
  • Big Data
Cloud Academy Team
— September 19, 2017

New on Cloud Academy, September ’17. Big Data, Security, and Containers

Explore the newest Learning Paths, Courses, and Hands-on Labs on Cloud Academy in September.Learning Paths and CoursesCertified Big Data Specialty on AWS Solving problems and identifying opportunities starts with data. The ability to collect, store, retrieve, and analyze data me...

Read more
  • AWS
  • Big Data
  • Docker
  • Google Cloud
Cloud Academy Team
— August 22, 2017

New on Cloud Academy: Networking, Serverless, Big data, and more

This week on Cloud Academy, we’ve added new learning paths and hands-on labs in networking, serverless, big data, storage, and other cloud services that you need to know about in AWS, Azure, and Google Cloud Platform.Learning PathsAWS Network Specialty Certification ExamAdvanced...

Read more
  • Analytics
  • Big Data
  • Networking & CDN
  • Security
Cloud Academy Team
— July 27, 2017

What is Azure Data Factory: Data Migration on the Azure Cloud

The availability of so much data is one of the greatest gifts of our day. But how does this impact a business when it’s transitioning to the cloud? Will your historic on-premise data be a hindrance if you’re looking to move to the cloud? What is Azure Data Factory? Is it possible to enr...

Read more
  • Analytics
  • Azure
  • Big Data
  • Data Migration
  • DataFactory
David Santucci
— March 14, 2017

Building a serverless architecture for data collection with AWS Lambda

AWS Lambda is one of the best solutions for managing a data collection pipeline and for implementing a serverless architecture. In this post, we'll discover how to build a serverless data pipeline in three simple steps using AWS Lambda Functions, Kinesis Streams, Amazon Simple Queue Ser...

Read more
  • AWS
  • Big Data
  • Lambda
Sudhi Seshachala
— August 4, 2016

46 Big Data Terms Defined

Organizations must deal with the collection and storage of continuously-growing data, and then harvest it to capture value. "Big Data," as its called, concerns itself with these complex processes.The following list contains 46 key Big Data terms that you're likely going to find in t...

Read more
  • Big Data
Eugene Teo
— June 22, 2016

Harnessing the Power of Big Data Analysis on AWS

Like a jigsaw puzzle, there are many components in the AWS big data ecosystem. Read this article and see how the components fit together to form a beautiful whole.If you are a data engineer, wouldn’t it be great if you could easily scale your existing infrastructure on-demand to sup...

Read more
  • AWS
  • Azure
  • Big Data
Chandan Patra
— May 31, 2016

HDInsight – Azure’s Hadoop Big Data Service

How can Azure HDInsight solve your big data challenges?Big data refers to large volumes of fast-moving data in any format that haven't yet been handled by your traditional data processing system. In other words, it refers to data which have Volume, Variety and Velocity (commonly terme...

Read more
  • AWS
  • Azure
  • Big Data
Eugene Teo
— March 1, 2016

Big Data: Amazon EMR, Apache Spark and Apache Zeppelin – Part 2 of 2

In the first article about Amazon EMR, in our two-part series, we learned to install Apache Spark and Apache Zeppelin on Amazon EMR. We also learned ways of using different interactive shells for Scala, Python, and R, to program for Spark.Let's continue with the final part of this s...

Read more
  • AWS
  • Big Data
Eugene Teo
— February 23, 2016

Big Data: Amazon EMR, Apache Spark, and Apache Zeppelin – Part 1 of 2

Amazon EMR (Elastic MapReduce) provides a platform to provision and manage Amazon EC2-based data processing clusters.Amazon EMR clusters are installed with different supported projects in the Apache Hadoop and Apache Spark ecosystems. You can either choose to install from a predefined...

Read more
  • AWS
  • Big Data
Chandan Patra
— January 25, 2016

Azure Data Lake Analytics and Big Data: an Introduction

Azure Data Lake Analytics simplifies the management of big data processing using integrated Azure resource infrastructure and complex code.We've previously discussed Azure Data Lake and Azure Data Lake Store. That post should provide you with a good foundation for understanding Azure ...

Read more
  • Analytics
  • Azure
  • Big Data