(Update) We’ve recently uploaded new training material on Big Data using services on Amazon Web Services, Microsoft Azure, and Google Cloud Platform on the Cloud Academy Training Library. On top of that, we’ve been busy adding new content on the Cloud Academy blog on how to best train yourself and your team on Big Data.
Big Data is becoming an integral part of enterprise business intelligence. Hadoop is the framework and the technology behind Big Data. It offers various tools to ingest, process and analyze large data sets that typically run into a few terabytes in size. Though Hadoop has matured, it is still considered for batch processing. Querying and analyzing real-time data with Hadoop is hard and expensive. At the heart of Hadoop is the MapReduce framework, which is not suitable for interactive queries. In the recent past, technologies like Cloudera’s Impala and Apache Spark started to complement MapReduce for dealing with the data in real time.
Though it was Google that heavily contributed to the MapReduce paradigm, it is also one of the first to identify the drawbacks of MapReduce. Google engineers realized that MapReduce is not ideal to query large, distributed data sets in real time. To solve this problem, Google came out with an internal tool called Dremel, which enabled the engineers to run SQL queries on large data sets in real time. Dremel was designed to deliver blazing fast query performance on distributed data sets that are stored across thousands of servers. It supports a subset of SQL for querying and retrieving data.
At Google I/O 2012, Google announced BigQuery that exposed Dremel to the outside world as a cloud service. Since then, BigQuery has evolved into a high performance and scalable query engine on the cloud.
The strength of BigQuery lies in its ability to handle large data sets. For example, querying tens of thousands of records might take only a few seconds. Even after twice the number of records, BigQuery would take the same time to process the query. Since it is based on standard SQL query language, it is fairly easy to construct complex queries to retrieve data. BigQuery is a structured data store on the cloud. It follows the paradigm of tables, fields, and records. However, unlike RDBMS, BigQuery supports repeated fields that can contain more than one value making it easy to query nested data. Google replicates BigQuery data across multiple data centers to make it highly available and durable.
When to use Google BigQuery?
So, when do you use BigQuery? Is it a replacement to traditional RDBMS? Is it an OLAP service? Is it a replacement to Apache Hadoop?
BigQuery typically comes at the end of the Big Data pipeline. It is not a replacement for existing technologies but it complements them very well. After processing the data with Apache Hadoop, the resulting data set can be ingested into BigQuery for analysis. Real-time streams representing sensor data, web server logs or social media graphs can be ingested into BigQuery to be queried in real time. After running the ETL jobs on traditional RDBMS, the resultant data set can be stored in BigQuery. Data can be ingested from the data sets stored in Google Cloud Storage, through direct file import or through streaming data. So, if Apache Hadoop is the means to Big Data, BigQuery is the end.
With the recent announcement of Google Cloud Pub/Sub and Google Cloud Dataflow, BigQuery will play an important role in Google’s cloud strategy. In future articles, we will explore how Cloud Dataflow and BigQuery can be combined to efficiently query real-time data streams.