This is a guest post by 47Line Technologies

We live in the Data Age! Web has been growing rapidly in size as well as scale during the last 10 years and shows no signs of slowing down. Statistics show that every passing year more data gets generated than all the previous years combined. Moore’s law not only holds true for hardware but for data being generated too! Without wasting time for coining a new phrase for such vast amounts of data, the computing industry decided to just call it, plain and simple, Big Data.

Apache Hadoop is a framework that allows for the distributed processing of such large data sets across clusters of machines. At its core, it consists of 2 sub-projects – Hadoop MapReduce and Hadoop Distributed File System (HDFS). Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

Figure 1: Map Reduce Archirecture

Figure 1: Map Reduce Archirecture

Figure 1: Map Reduce Architecture

The logical question arises – How do we set up a Hadoop cluster?

Installation of Apache Hadoop 1.x

We will proceed to install Hadoop on 3 machines. One machine, the master, is the NameNode & JobTracker and the other two, the slaves, are DataNodes & TaskTrackers.


  1. Linux as development and production platform (Note: Windows is only a development platform. It is not recommended to use in production)
  2. Java 1.6 or higher, preferably from Sun, must be installed
  3. ssh must be installed and sshd must be running
  4. From a networking standpoint, all the 3 machines must be pingable from one another

Before proceeding with the installation of Hadoop, ensure that the prerequisites are in place on all the 3 machines. Update /etc/hosts on all machines so as to enable references as master, slave1 and slave2.

Download and Installation

Download Hadoop 1.2.1. Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster.

Configuration Files

The below mentioned files need to be updated:


On all machines, edit the file conf/ to define JAVA_HOME to be the root of your Java installation. The root of the Hadoop distribution is referred to as HADOOP_HOME. All machines in the cluster usually have the same HADOOP_HOME path.


Update this file on master machine alone with the following line:



Update this file on master machine alone with the following lines:



Update this file on all machines:


Update this file on all machines:


The default value of dfs.replication is 3. Since there are only 2 DataNodes in our Hadoop cluster, we update this value to 2. Update this file on all machines –

After changing these configuration parameters, we have to format HDFS via the NameNode.

We start the Hadoop daemons to run our cluster.

On the master machine,

Your Hadoop cluster is up and running!

Loading Data into HDFS

Data stored in databases and data warehouses within a corporate datacenter has to be efficiently transferred into HDFS. Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. The input to the import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this import process is a set of files containing a copy of the imported table.

Figure 2: Sqoop architecture

Figure 2: Sqoop architecture

Figure 2: Sqoop Architecture


  1. A working Hadoop cluster

Download and Installation

Download Sqoop 1.4.4. Installing Sqoop typically involves unpacking the software on the NameNode machine. Set SQOOP_HOME and add it to PATH.

Let’s consider that MySQL is the corporate database. In order for Sqoop to work, we need to copy mysql-connector-java-<version>.jar into SQOOP_HOME/lib directory.

Import data into HDFS

As an example, a basic import of a table named CUSTOMERS in the cust database:

On successful completion, a set of files containing a copy of the imported table is present in HDFS.

Analysis on HDFS Data

Now that data is in HDFS, it’s time to perform analysis on the data and gain valuable insights.

During the initial days, end users have to write map/reduce programs for simple tasks like getting raw counts or averages. Hadoop lacks the expressibility of popular query languages like SQL and as a result users ended up spending hours (if not days) to write programs for typical analysis.

Enter Hive!

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis. Hive was created to make it possible for analysts with strong SQL skills (but meager Java programming skills) to run queries on the huge volumes of data to extract patterns and meaningful information. It provides an SQL-like language called HiveQL while maintaining full support for MapReduce. Any HiveQL query is divided into MapReduce tasks which run on the robust Hadoop framework.

Figure 3: Hive Architecture

Figure 3: Hive Architecture

Figure 3: Hive Architecture


  1. A working Hadoop cluster

Download and Installation

Download Hive 0.11. Installing Hive typically involves unpacking the software on the NameNode machine. Set HIVE_HOME and add it to PATH.

In addition, you must create /tmp and /user/hive/warehouse (a.k.a. hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before you can create a table in Hive. The commands are listed below –

Another important feature of Sqoop is that it can import data directly into Hive.

The above command created a new Hive table CUSTOMERS and loads it with the data from the corporate database. It’s time to gain business insights!

select count (*) from CUSTOMERS;

Apache Hadoop, along with its ecosystem, enables us to deal with Big Data in an efficient, fault-tolerant and easy manner!

Verticals such as airlines have been collecting data (eg: flight schedules, ticketing inventory, weather information, online booking logs) which reside in disparate machines. Many of these companies do not yet have systems in place to analyze these data points at a large scale. Hadoop and its vast ecosystem of tools can be used to gain valuable insights, from understanding customer buying patterns to upselling certain privileges to increasing ancillary revenue, to stay competitive in today’s dynamic business landscape.

This is the first in the series of Hadoop articles. In subsequent posts, we will deep dive into Hadoop and its related technologies. We will talk about real-world use cases and how we used Hadoop to solve these problems! Stay tuned.
  • Jason

    Loved the blog post, great introduction to Hadoop as well, I hope you guys will go ahead with this with other articles.

    BTW: Are you considering doing a bunch of video courses about this? Jason

    • 47Line Technologies

      Thanks Jason for your kind comments. We plan to write a series of articles on Big Data technologies, Configuration management systems (Chef, Puppet), NoSQL databases and other newer technologies being used by high tech companies today. We also plan to start video courses soon. Keep checking back for more updates here.

  • vignesh

    Its really helpful for me to understand where we i lost in my previous interview. Thanks.

    If anyone wants to Learn Hadoop in Chennai go to the Besant Technologies which is No.1 Training Institute in Chennai.I learn Hadoop my job releated veryful your review

  • Thanks for sharing valuable information for new beginners and also anyone wants to learn hadoop course we are providing both online course and class room course.

    Hadoop online training

  • Gabe

    Great post! In the world of business intelligence (BI) and big data, Apache Hadoop
    receives quite a bit of attention and buzz. Despite this extensive
    coverage, however, Hadoop remains a gray area for many. Unless you work
    with Hadoop daily, you may have a vague idea of what it is and what it
    does, but beyond that, you’re likely in over your head. More at