The course is part of these learning paths
BigQuery is Google's incredibly fast, secure, and surprisingly inexpensive data warehouse, but there are ways to make it even faster, cheaper, and more secure.
Here are some examples of what you will learn in this course:
- BigQuery can process billions of rows in seconds, but only if you break the rules of relational database design.
- If you are analyzing relatively small amounts of data, then your queries won’t cost very much, but if you regularly analyze huge datasets, then your costs can add up quickly. However, with a few adjustments to how you store your data in BigQuery, you can run queries for a fraction of the cost.
- To give you the flexibility to implement fine-grained security, BigQuery has several layers of access control capabilities, but they can be confusing, so I’ll show you which ones to use to meet your organization’s requirements.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account. If you don’t already have experience with BigQuery, then please take Introduction to Google BigQuery first.
- Reduce your BigQuery costs by reducing the amount of data processed by your queries
- Create, load, and query partitioned tables for daily time series data
- Speed up your queries by using denormalized data structures, with or without nested repeated fields
- Implement fine-grained access control using roles and authorized views
In the last lesson, I showed you how you could break your data into tables to limit how much data your queries process. But that seems kind of awkward, doesn’t it? There are, in fact, quite a few disadvantages to this approach. First, you have to get your data into all of these tables somehow, probably by writing a program. Second, the more tables there are in your query, the worse your query performance is. And third, you can’t reference more than 1,000 tables in a single query. You might think you’re not likely to ever hit that limit, but it’s not as unlikely as it sounds. If you collect a lot of data every day, then you may want to have a separate table for every day. Then if you had 3 years worth of data, that would be more than 1,000 tables.
BigQuery has a solution for this called partitioned tables. The way it works is you put all of your data in one table, but in a separate partition for every day. Then you can limit your queries to particular days and it won’t scan the rest of the table.
Of course, you still have the challenge of how to divide your data into the right partitions, just like you would with dividing your data into separate files with the other approach. If you want to put your existing data into a partitioned table, then you will likely have to write a program to divide up your data. On the other hand, if you’re starting fresh and want to stream new data into a table every day, then it’s very easy. When you stream or upload data to a partitioned table, BigQuery will automatically put it in the partition for that day.
Unfortunately, it wouldn’t be very useful to show you the easy way because all of the data would go into today’s partition, so you wouldn’t be able to see how to query across partitions. That means we’ll have to take some existing data and get it into a partitioned table the hard way.
I’ve written a bash script that will take the data from the gbpusd_201401 table and put it into a new table in 31 partitions, one for each day of January 2014. You can get the script from the github repository. It’s called “partition”.
This is a pretty simple script. First, it creates the partitioned table with the same name as the original table except with a ‘p’ for ‘partition’ at the end. To make this a partitioned table, it uses the time_partitioning_type flag and sets it to DAY. You might be wondering if you could set that to something like MONTH or YEAR instead, so you wouldn’t need so many partitions in some cases. Nope. DAY is the only value it will accept.
Then it goes through a loop from 1 to 31 because there are 31 days in January. The next part just puts a zero in front of the number if it’s less than 10 because partition names are dates and the day part of the date is always two digits. The most important line of the script is this bq command, which looks ridiculously complicated, but it’s actually not too bad.
It does a query with the use_legacy_sql flag set to false, and the replace flag, which isn’t strictly necessary, but if you need to run the script over again, then it will overwrite the table instead of appending to it. Then it sets the destination flag, and this is where you can see how to refer to partitions. You start with the full name of the table and then a dollar sign and then the date of the partition, which is the year, the month, and the day.
In this case, there’s a $n at the end, which is the 2-digit day that was set in the code above. So the first time through this loop, $n will be zero one. The rest of the line is the SQL command that BigQuery is supposed to run to get the data to put into the partition. It selects the data for one day.
There’s something you should know before you run this script. It scans the entire original table 31 times, because when it extracts the data for one day, it has to read the whole table. So how much will it cost to run this script? Well, it processes 46.4 meg 31 times, which works out to about 1.4 gig. Google gives you the first terabyte per month for free, but even after that, it’s still only half a cent per gig, so it would cost less than a penny. That shouldn’t be a problem, but if you were doing this with a much larger table, you might want to do something different.
If you’re using a Mac or a Linux desktop, then you can just download the file and run it (assuming you have the Google Cloud SDK installed).
If you’re using Windows, then you can run the script from Google Cloud Shell, which I’ll show you in a minute. Even if you use Cloud Shell, you’ll still need to download the script to your local desktop first, so let’s do that.
There are a few different ways to download the file from GitHub, but probably the easiest is to download a zip file of the whole repository. On the main page of the repository, click the “Clone or download” button. If you have git installed on your desktop, then you can clone the repository. Otherwise, click “Download ZIP”. Then extract the files and go into that directory from the command line.
Before we run it, I’ll show you how to do it from Cloud Shell if you don’t have bash installed on your desktop. Go to https://console.cloud.google.com and click the “Activate Google Cloud Shell” button. It’s actually provisioning a VM instance, so it will take a little while to start.
Once Cloud Shell is running, click on the Files icon and select “Upload file”. Then choose the partition script file that you just downloaded.
Regardless of whether you’re on a Mac, Linux, or Cloud Shell, you should now be at a command line sitting in the directory where the “partition” script resides, so type “bash partition” to run it.
It may ask you to select a default project. Then it will go through the loop 31 times. It should take about 3 minutes.
Alright, it’s done. To see the new table, you may need to refresh your browser. Now have a look at the table. It’s exactly the same as the original table except for one thing. It has a column called _PARTITIONTIME. However, that column doesn’t show up in the preview because it’s a pseudo-column. It’s only there to make it easier to access records by their timestamp.
Let’s say you wanted to retrieve records only from January 9th and 10th. Here’s the SQL to do that.
It’s the same as a query we tried to run earlier except that this one looks at the _PARTITIONTIME pseudo-column instead of the time column. What a difference that makes. It says it will only process 1.89 meg. When we tried to do it with the time column, it said it would process 19.5 meg. Let’s run it to make sure it works. Yes, it only processed 1.89 meg.
And that’s it for this lesson.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).