Big Data: Getting Started with Hadoop, Sqoop & Hive
(Update) We’ve recently uploaded new training material on Big Data using services on Amazon Web Services, Microsoft Azure, and Google Cloud Platfor...Learn More
This is a guest article by 47Line Technologies.
Last week we introduced Big Data ecosystem and showed a glimpse of the possibilities. This week we take one industry (Retail) use case and illustrate how the various tools can be orchestrated to provide insights.
The last couple of decades has seen a tectonic shift in the retail industry. The hawkers and mom and pop stores are being sidelined by heavyweight retail hypermarkets who operate in a complex landscape involving franchisees, JVs, and multi-partner vendors. In this kind of an environment, try visualizing the inventory, sales, supplier info for thousands of SKUs (Stock Keeping Units) per store and multiply it with the several thousand stores across cities, states and even countries over days, months and years and you will realize the volume of data they would be collecting.
One such retail hypermarket, let’s say BigRetail had 5-years of data containing vast amounts of a semi-structured dataset which they wanted to be analyzed for trends and patterns.
This is where Hadoop shines in all its glory!
The problem is 2-fold:
Problem 1: Moving the logs into HDFS periodically
Problem 2: Performing analysis on this HDFS dataset
As we had seen in the previous post, Apache Sqoop is used to move structured dataset into HDFS. Alas! How do we move semi-structured data? Fret not. Apache Flume is specially designed for collecting, aggregating, and moving large amounts of log data into HDFS. Once the dataset is inside HDFS, Hive was used to perform analysis.
Let’s dig deep. Mind you – The devil is in the details.
Problem 1: How Flume solved the data transfer problem?
The primary use case for Flume is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent HDFS store.
Flume Agentis installed on each node of the cluster that produces log messages. These streams of log messages from every node are then sent to the
Flume Collector. The collectors then aggregate the streams into larger streams which can then be efficiently written to a storage tier such as HDFS.
Problem 2: Analysis using Hive
hdfs://user/hive/warehouse) to enable additional query optimizations.
While reading log files, the simplest recommended approach during Hive table creation is to use a
RegexSerDe. It uses regular expression (regex) to serialize/deserialize. It deserializes the data using regex and extracts groups as columns. It can also serialize the row object using a format string.
RegexSerDeall columns have to be strings. Use “
CAST (a AS INT)” to convert columns to other types.
As mentioned earlier, the data-set to be analyzed was 13TB. Using the Hadoop default replication factor of 3, it would require 13TB * 3 = 39TB of storage capacity. After a couple of iterations on a smaller sample data set and subsequent performance tuning, we finalized the below cluster configuration and capacities –
Following Flume parameters were configured (sample) –
flume.event.max.size.bytesuses the default value of 32KB.
flume.agent.logdirwas changed to point to an appropriate HDFS directory
flume.master.servers: 3 Flume Masters –
flume.master.storeuses the default value – zookeeper
Following Hive parameters were configured (sample) –
javax.jdo.option.ConnectionDriverName: set the value to “
By default, Hive metadata is usually stored in an embedded
Derby database which allows only one user to issue queries. This is not ideal for production purposes. Hence, Hive was configured to use
MySQL in this case.
Using the Hadoop system, log transfer time was reduced to ~3 hours weekly and querying time also was significantly improved.
Some of the schema tables that were present in the final design were –
payments. Some sample Hive queries that were executed as part of the analysis are as follows –
Select count (*) from facts;
Select gender, count (DISTINCT customer_id) from customers group by gender;
Only equality joins, inner & outer joins, semi joins and map joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a MapReduce job. Also, more than two tables can be joined in Hive.
Select products .product_name, products .product_id, categories.category_name from products JOIN categories ON (products.product_category_id = categories.category_id);
Select locations.location_name, count (DISTINCT facts.payment_id) from facts JOIN locations ON (facts.location_id = locations.location_id) group by locations .location_name;
Some of the interesting trends that were observed from this dataset using Hive were:
This is just one illustration of the possibilities of Big Data analysis and we will try to cover more in the coming articles. Stay tuned!
How can Azure HDInsight solve your big data challenges?Big data refers to large volumes of fast-moving data in any format that haven't yet been handled by your traditional data processing system. In other words, it refers to data which have Volume, Variety and Velocity (commonly terme...
Amazon EMR (Elastic MapReduce) allows developers to avoid some of the burdens of setting up and administrating Hadoop tasks. Learn how to optimize it.Apache Hadoop is an open source framework designed to distribute the storage and processing of massive data sets across virtually limit...
(Update) We’ve recently uploaded new training material on Big Data using services on Amazon Web Services, Microsoft Azure, and Google Cloud Platform on the Cloud Academy Training Library. On top of that, we’ve been busy adding new content on the Cloud Academy blog on how to best train y...