Reducing the Amount of Data Processed

Contents

Introduction
1
Introduction
PREVIEW1m 5s
Security
Summary
Reducing the Amount of Data Processed
Difficulty
Intermediate
Duration
37m
Students
3374
Ratings
4.8/5
starstarstarstarstar-half
Description

BigQuery is Google's incredibly fast, secure, and surprisingly inexpensive data warehouse, but there are ways to make it even faster, cheaper, and more secure. 

Here are some examples of what you will learn in this course:

  • BigQuery can process billions of rows in seconds, but only if you break the rules of relational database design.
  • If you are analyzing relatively small amounts of data, then your queries won’t cost very much, but if you regularly analyze huge datasets, then your costs can add up quickly. However, with a few adjustments to how you store your data in BigQuery, you can run queries for a fraction of the cost.
  • To give you the flexibility to implement fine-grained security, BigQuery has several layers of access control capabilities, but they can be confusing, so I’ll show you which ones to use to meet your organization’s requirements.

This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.

Learning Objectives

  • Reduce your BigQuery costs by reducing the amount of data processed by your queries
  • Create, load, and query partitioned tables for daily time-series data
  • Speed up your queries by using denormalized data structures, with or without nested repeated fields
  • Implement fine-grained access control using roles and authorized views

Intended Audience

  • Database administrators
  • Anyone who wants to learn how to get the most out of Google BigQuery

Prerequisites

To get the most out of this course, you should already have some experience with BigQuery. If you don’t, then please take Introduction to Google BigQuery first.

Resources

The GitHub repository for this course can be found at https://github.com/cloudacademy/optimizing-bigquery/.

Transcript

BigQuery is relatively inexpensive to use. But if you process large amounts of data, your costs can add up quickly. At first, it looks like streaming and storage would be your highest costs. That would be true if you didn’t run many queries, but most organizations use BigQuery to run lots of queries. After all, query is even part of the name, so that must be what people use it for, right?

Although the streaming and storage costs are higher than the query costs on a per gigabyte basis, you only get charged for streaming once and you only get charged for storage once a month. With queries, on the other hand, you get charged every time you run a query, which can be hundreds, or even thousands, of times a month, so reducing how much data is processed by your queries is usually the best way to reduce your costs. Another benefit is that it often speeds up your queries too.

To see how to optimize our queries, let’s use some stock exchange data that Google makes available on Cloud Storage. First, create a dataset where you can put the tables. Click the down arrow next to your project name and select “Create new dataset”. Let’s call it “examples”.

Now create a table in the examples dataset. Change the Location to Google Cloud Storage. Here’s the path to the first stock exchange file.

There will be quite a bit of text to enter in this course, especially when we start writing queries, so if you don’t want to type everything yourself, you can copy and paste from a readme file I put on Github. It’s at this URL.

Now call the Destination Table “gbpusd_201401”. For the Schema, click “Edit as Text” and put this in. Make sure you get rid of the text that was in this box already. Click “Create Table”...and it takes a little while to upload, so I’ll fast forward. I’m going to fast forward quite a few times throughout the course, so if some operations seem to take longer when you run them, it’s probably because of that.

Then go through the same process for the second file.

OK, now we can run a simple query to get a baseline before we try to optimize it. Click “Compose Query”. Then, before you do anything else, click on “Show Options” and uncheck the “Use Legacy SQL” box. I don’t recommend using Legacy SQL because you have to learn some non-standard syntax to use it. Now click “Hide Options” so it doesn’t clutter up the page. OK, click on the first table and then click “Query Table”. Let’s select star so we see how much data gets processed when we read all of the data. Let’s also get rid of the “LIMIT 1000” to make sure it’s reading the whole table.

Before you run it, see how much data it says it’s going to process. Click “Open Validator”, which is the green checkmark. It says it will process 46.4 meg. Now click “Run Query”. It warns us that we’ll be billed for all of the data in the table, even if we use a LIMIT clause. We actually want to read all the data in the table this time, so ignore the warning. Click “Run query”...and after a few seconds, it comes back and confirms that it processed 46.4 meg.

There are two basic approaches to reducing how much data is processed by a query. You can reduce the number of rows that are scanned and you can reduce the number of columns that are scanned.

One obvious way to try to reduce the number of rows processed is by using a LIMIT clause, but that warning message we just got said that we’d be billed for all the data in the table even if we use a LIMIT clause. Let’s try it to see if that’s true.

Add a “LIMIT 10” clause to the end of the query.

It still says this query will process 46.4 meg, but let’s run it anyway. And the warning message comes up again. Click “Run query”...it takes a few seconds, and it says that it processed the whole 46.4 meg again. It did run slightly faster, but that’s probably not because of the LIMIT clause. The processing time varies, even for the exact same query.

The bottom line is that we can’t reduce the number of rows scanned by using a LIMIT clause, so we’ll have to look at other approaches. I’ll show you how to do that later, but first let’s try limiting the number of columns scanned.

The obvious way to do that is to only select specific columns instead of doing a SELECT star. For example, on this table, instead of selecting all 5 columns, let’s just select 2 of them. Replace the star with “time, bid”.

That only processed 19.5 meg, so it actually worked. You probably also noticed that it didn’t give us a warning message this time. That’s because it doesn’t give you a warning when you select specific columns instead of selecting star.

Let’s go back to trying to reduce the number of rows processed. Have a look at the data by going to the Preview tab. Do you see a pattern? The data is sorted in ascending order by the time column. Maybe we could try to use the BETWEEN operator to only select rows that are between two dates. You can copy and paste this query from the Github file.

Since the validator accurately predicts how much data will be processed by a query, we don’t even have to run this to know that it won’t reduce the number of rows scanned because it says it will process 19.5 meg, which is how much it processed without the BETWEEN operator. It makes sense, when you think about it, because there’s no way for BigQuery to know that all of the data is properly sorted, so it still has to read the entire table.

Is there anything we can do to reduce the number of rows processed? Yes, there are a couple of ways, but they’re not easy. The first way is to put your data in separate tables. That’s kind of a brute force way of limiting your queries. It still scans the whole table, but since you only put some of your data in the table, that automatically limits how much data is processed.

The two tables we uploaded are actually split that way because there’s one for each month. The first is for January 2014 and the second is for February 2014. So all of the queries we’ve run so far have only been on the January data. If, instead, all of the data for 2014 were in one file, then we would have processed about 12 times as much data on every query.

In a normal relational database, breaking your data into tables like this is not recommended. Instead, you would use an index. But BigQuery doesn’t support indexes, so you can’t do that.

If you break your data into multiple tables, then how can you run queries across the tables when you need to? For example, what if you needed to run a query across all of the 2014 data? Would you have to write a long query with a UNION of all 12 tables? No, because fortunately BigQuery supports using wildcards in your table references.

Here’s how you would run a query across the two stock market tables. Let’s select the earliest time and the latest time from the tables. To query both tables, you just need to put in a star as the last character of the table name, which will match both 1 and 2. The wildcard has to be the last character of the table name.

When you run the query...you’ll see that it took the mintime from the first table because it’s in January and it took the maxtime from the second table because it’s in February.

You’ll recall that I mentioned that there are two ways to reduce the number of rows processed. I’ll tell you how to use the other method in the next lesson.

About the Author
Students
191814
Courses
98
Learning Paths
169

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).