When to Use Google BigQuery? Big Data in the Cloud

(Update) We’ve recently uploaded new training material on Big Data using services on Amazon Web Services, Microsoft Azure, and Google Cloud Platform on the Cloud Academy Training Library. On top of that, we’ve been busy adding new content on the Cloud Academy blog on how to best train yourself and your team on Big Data.


BigQuery_128pxBig Data is becoming an integral part of enterprise business intelligence. Hadoop is the framework and the technology behind Big Data. It offers various tools to ingest, process and analyze large data sets that typically run into a few terabytes in size. Though Hadoop has matured, it is still considered for batch processing. Querying and analyzing real-time data with Hadoop is hard and expensive. At the heart of Hadoop is the MapReduce framework, which is not suitable for interactive queries. In the recent past, technologies like Cloudera’s Impala and Apache Spark started to complement MapReduce for dealing with the data in real time.

Though it was Google that heavily contributed to the MapReduce paradigm, it is also one of the first to identify the drawbacks of MapReduce. Google engineers realized that MapReduce is not ideal to query large, distributed data sets in real time. To solve this problem, Google came out with an internal tool called Dremel, which enabled the engineers to run SQL queries on large data sets in real time. Dremel was designed to deliver blazing fast query performance on distributed data sets that are stored across thousands of servers. It supports a subset of SQL for querying and retrieving data.
At Google I/O 2012, Google announced BigQuery that exposed Dremel to the outside world as a cloud service. Since then, BigQuery has evolved into a high performance and scalable query engine on the cloud.

The strength of BigQuery lies in its ability to handle large data sets. For example, querying tens of thousands of records might take only a few seconds. Even after twice the number of records, BigQuery would take the same time to process the query. Since it is based on standard SQL query language, it is fairly easy to construct complex queries to retrieve data. BigQuery is a structured data store on the cloud. It follows the paradigm of tables, fields, and records. However, unlike RDBMS, BigQuery supports repeated fields that can contain more than one value making it easy to query nested data. Google replicates BigQuery data across multiple data centers to make it highly available and durable.

When to use Google BigQuery?

So, when do you use BigQuery? Is it a replacement to traditional RDBMS? Is it an OLAP service? Is it a replacement to Apache Hadoop?

BigQuery typically comes at the end of the Big Data pipeline. It is not a replacement for existing technologies but it complements them very well. After processing the data with Apache Hadoop, the resulting data set can be ingested into BigQuery for analysis. Real-time streams representing sensor data, web server logs or social media graphs can be ingested into BigQuery to be queried in real time. After running the ETL jobs on traditional RDBMS, the resultant data set can be stored in BigQuery. Data can be ingested from the data sets stored in Google Cloud Storage, through direct file import or through streaming data. So, if Apache Hadoop is the means to Big Data, BigQuery is the end.

With the recent announcement of Google Cloud Pub/Sub and Google Cloud Dataflow, BigQuery will play an important role in Google’s cloud strategy. In future articles, we will explore how Cloud Dataflow and BigQuery can be combined to efficiently query real-time data streams.

Avatar

Written by

Janakiram MSV

Janakiram MSV heads the Cloud Infrastructure Services at Aditi Technologies. He contributes to cloud related articles on YourStory.com. A former employee of Microsoft and Amazon, Janakiram built a cloud consulting company that recently got acquired by Aditi Technologies. He is an analyst with Gigaom Research contributing to the Cloud related market research and analysis. He can be reached at jani@janakiram.com.


Related Posts

Chris Gambino
Chris Gambino
— July 16, 2019

Introduction to Streaming Data

Designing a streaming data pipeline presents many challenges, particularly around specific technology requirements. When designing a cloud-based solution, an architect is no longer faced with the question, “How do I get this job done with the technology we have?” but rather, “What is th...

Read more
  • amazon kinesis
  • Big Data
  • Data scoping
  • IoT
  • Streaming data
Stefano Bellasio
Stefano Bellasio
— April 26, 2018

Top Cloud Skills in Demand for 2018: Big Data, AI, Machine Learning

Cloud is a pathway to innovation. Where yesterday’s cloud deployments were about moving an on-premises infrastructure in your data center to a cloud environment, companies today are using cloud platforms to build new features for their products and services that are integrated at a soft...

Read more
  • Big Data
  • GDPR
  • Machine Learning
Avatar
Cloud Academy Team
— November 22, 2017

November ’17 New on Cloud Academy: DC/OS, Serverless, Security, Big Data, and more

Explore the newest learning paths, courses, and hands-on labs on Cloud Academy in November. Learning Paths Introduction to DC/OS In an enterprise environment, running multiple workload types simultaneously can be both difficult and costly, especially when servers aren’t being used ...

Read more
  • AWS
  • Azure
  • Big Data
Avatar
Cloud Academy Team
— September 19, 2017

New on Cloud Academy, September ’17. Big Data, Security, and Containers

Explore the newest Learning Paths, Courses, and Hands-on Labs on Cloud Academy in September. Learning Paths and Courses Certified Big Data Specialty on AWS Solving problems and identifying opportunities starts with data. The ability to collect, store, retrieve, and analyze data me...

Read more
  • AWS
  • Big Data
  • Docker
  • Google Cloud Platform
Avatar
Cloud Academy Team
— August 22, 2017

New on Cloud Academy: Networking, Serverless, Big data, and more

This week on Cloud Academy, we’ve added new learning paths and hands-on labs in networking, serverless, big data, storage, and other cloud services that you need to know about in AWS, Azure, and Google Cloud Platform. Learning Paths AWS Network Specialty Certification Exam Advanced...

Read more
  • Analytics
  • Big Data
  • Networking & CDN
  • Security
Avatar
Cloud Academy Team
— July 27, 2017

What is Azure Data Factory: Data Migration on the Azure Cloud

The availability of so much data is one of the greatest gifts of our day. But how does this impact a business when it’s transitioning to the cloud? Will your historic on-premise data be a hindrance if you’re looking to move to the cloud? What is Azure Data Factory? Is it possible to enr...

Read more
  • Analytics
  • Azure
  • Big Data
  • Data Migration
  • DataFactory
Avatar
David Santucci
— March 14, 2017

Building a serverless architecture for data collection with AWS Lambda

AWS Lambda is one of the best solutions for managing a data collection pipeline and for implementing a serverless architecture. In this post, we'll discover how to build a serverless data pipeline in three simple steps using AWS Lambda Functions, Kinesis Streams, Amazon Simple Queue Ser...

Read more
  • AWS
  • Big Data
  • Lambda
Avatar
Sudhi Seshachala
— August 4, 2016

46 Big Data Terms Defined

Organizations must deal with the collection and storage of continuously-growing data, and then harvest it to capture value. "Big Data," as its called, concerns itself with these complex processes. The following list contains 46 key Big Data terms that you're likely going to find in t...

Read more
  • Big Data
Avatar
Eugene Teo
— June 22, 2016

Harnessing the Power of Big Data Analysis on AWS

Like a jigsaw puzzle, there are many components in the AWS big data ecosystem. Read this article and see how the components fit together to form a beautiful whole. If you are a data engineer, wouldn’t it be great if you could easily scale your existing infrastructure on-demand to sup...

Read more
  • AWS
  • Azure
  • Big Data
Avatar
Chandan Patra
— May 31, 2016

HDInsight – Azure’s Hadoop Big Data Service

How can Azure HDInsight solve your big data challenges? Big data refers to large volumes of fast-moving data in any format that haven't yet been handled by your traditional data processing system. In other words, it refers to data which have Volume, Variety and Velocity (commonly terme...

Read more
  • AWS
  • Azure
  • Big Data
  • Hadoop
Avatar
Eugene Teo
— March 1, 2016

Big Data: Amazon EMR, Apache Spark and Apache Zeppelin – Part 2 of 2

In the first article about Amazon EMR, in our two-part series, we learned to install Apache Spark and Apache Zeppelin on Amazon EMR. We also learned ways of using different interactive shells for Scala, Python, and R, to program for Spark. Let's continue with the final part of this s...

Read more
  • AWS
  • Big Data
  • EMR
Avatar
Eugene Teo
— February 23, 2016

Big Data: Amazon EMR, Apache Spark, and Apache Zeppelin – Part 1 of 2

Amazon EMR (Elastic MapReduce) provides a platform to provision and manage Amazon EC2-based data processing clusters. Amazon EMR clusters are installed with different supported projects in the Apache Hadoop and Apache Spark ecosystems. You can either choose to install from a predefined...

Read more
  • AWS
  • Big Data
  • EMR