The course is part of these learning paths
BigQuery is Google's incredibly fast, secure, and surprisingly inexpensive data warehouse, but there are ways to make it even faster, cheaper, and more secure.
Here are some examples of what you will learn in this course:
- BigQuery can process billions of rows in seconds, but only if you break the rules of relational database design.
- If you are analyzing relatively small amounts of data, then your queries won’t cost very much, but if you regularly analyze huge datasets, then your costs can add up quickly. However, with a few adjustments to how you store your data in BigQuery, you can run queries for a fraction of the cost.
- To give you the flexibility to implement fine-grained security, BigQuery has several layers of access control capabilities, but they can be confusing, so I’ll show you which ones to use to meet your organization’s requirements.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
- Reduce your BigQuery costs by reducing the amount of data processed by your queries
- Create, load, and query partitioned tables for daily time-series data
- Speed up your queries by using denormalized data structures, with or without nested repeated fields
- Implement fine-grained access control using roles and authorized views
- Database administrators
- Anyone who wants to learn how to get the most out of Google BigQuery
To get the most out of this course, you should already have some experience with BigQuery. If you don’t, then please take Introduction to Google BigQuery first.
The GitHub repository for this course can be found at https://github.com/cloudacademy/optimizing-bigquery/.
In the last lesson, I showed you how you could break your data into tables to limit how much data your queries process. But that seems kind of awkward, doesn’t it? There are, in fact, quite a few disadvantages to this approach. First, you have to get your data into all of these tables somehow, probably by writing a program. Second, the more tables there are in your query, the worse your query performance is. And third, you can’t reference more than 1,000 tables in a single query. You might think you’re not likely to ever hit that limit, but it’s not as unlikely as it sounds. If you collect a lot of data every day, then you may want to have a separate table for every day. Then if you had 3 years worth of data, that would be more than 1,000 tables.
BigQuery has a solution for this called partitioned tables. The way it works is you put all of your data in one table, but in a separate partition for every day. Then you can limit your queries to particular days and it won’t scan the rest of the table.
Of course, you still have the challenge of how to divide your data into the right partitions, just like you would with dividing your data into separate files with the other approach. If you’re starting fresh and want to stream new data into a table every day, then it’s very easy. You can create an ingestion-time partitioned table, and when you stream or upload data to it, BigQuery will automatically put it in the partition for that day. It also creates a pseudo-column called _PARTITIONTIME that contains the date for each data record.
It wouldn’t be very useful to upload our existing data to an ingestion-time partitioned table, though, because all of the data would go into today’s partition regardless of what date was in each record.
The solution is to create a time-unit column-partitioned table. You just need to tell BigQuery which column contains the date, and it will put each data record into the right partition.
We can copy the data from our existing table and put it into a new partitioned table in a single command. You can copy it from the GitHub repository.
We use “bq query” because we’re going to query the existing table to get the data. We set the “use_legacy_sql” flag to “false” so it’ll use standard SQL. Then we use the “replace” flag, which isn’t strictly necessary, but if you need to run the command again, then it will overwrite the destination table instead of appending to it. Then we tell it what to call the destination table. I used the same name as the original table except with a ‘p’ for ‘partition’ at the end. To make this a partitioned table, we use the time_partitioning_field flag and set it to the column that contains the date, which is called “time” in the original table. Finally, we run a “SELECT *” to get all of the data from the original table.
By the way, if you wanted to partition the data based on a different unit of time than a day, you’d need to add the “time_partitioning_type”, which can be set to day, hour, month, or year. If you don’t set it, then it defaults to “day”.
I’m going to run the command from Cloud Shell, which is right here. I need to authorize it.
Alright, it’s done. You’ll notice that I recorded this demo after Google changed BigQuery’s user interface, so that’s why it looks different.Now let’s have a look at the partitioned table it created. It looks exactly the same as the original table. The only indication that there’s anything different about it is this note up here saying that it’s a partitioned table.
Let’s say you wanted to retrieve records only from January 9th and 10th. Here’s the SQL to do that.
SELECT time, bid
ORDER BY time ASC
It’s basically the same as a query we tried to run earlier except that it’s querying the partitioned table instead of the original one. What a difference that makes. It says it will only process 1.3 meg. When we tried to do it with the original table, it said it would process 19.5 meg. Let’s run it to make sure it works. Yes, it only processed 1.3 meg.
Before we move on, I should mention another really useful feature of partitioned tables. If you have more data being added to a table every day, you may want to delete the oldest data at some point. You can configure a partitioned table to do this automatically by setting a partition expiration time. For example, you could say that each partition will be deleted after it’s 12 months old. This would ensure that you only have the latest 12 months worth of data in your table. You can also set an expiration time on an entire table if you want.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).