The course is part of these learning paths
BigQuery is Google's incredibly fast, secure, and surprisingly inexpensive data warehouse, but there are ways to make it even faster, cheaper, and more secure.
Here are some examples of what you will learn in this course:
- BigQuery can process billions of rows in seconds, but only if you break the rules of relational database design.
- If you are analyzing relatively small amounts of data, then your queries won’t cost very much, but if you regularly analyze huge datasets, then your costs can add up quickly. However, with a few adjustments to how you store your data in BigQuery, you can run queries for a fraction of the cost.
- To give you the flexibility to implement fine-grained security, BigQuery has several layers of access control capabilities, but they can be confusing, so I’ll show you which ones to use to meet your organization’s requirements.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
- Reduce your BigQuery costs by reducing the amount of data processed by your queries
- Create, load, and query partitioned tables for daily time-series data
- Speed up your queries by using denormalized data structures, with or without nested repeated fields
- Implement fine-grained access control using roles and authorized views
- Database administrators
- Anyone who wants to learn how to get the most out of Google BigQuery
To get the most out of this course, you should already have some experience with BigQuery. If you don’t, then please take Introduction to Google BigQuery first.
The GitHub repository for this course can be found at https://github.com/cloudacademy/optimizing-bigquery/.
Now you know how to reduce the amount of data BigQuery processes, which will reduce your costs, but it won’t necessarily reduce how long it takes your queries to run. There are different strategies for doing that, but we might have to break some database structure rules.
One of the core principles of good schema design is normalization, which is basically to remove redundancy from data. Some of you are probably saying, “That’s a dramatic oversimplification of the concept, and removing redundancy is more of a result of normalizing data.” You’re right, of course, but for our purposes here, the lack of redundancy is what matters. I should mention, though, that not only does normalization reduce the amount of space taken up by the data, but it also improves data integrity, because if the same information is stored in multiple places, you could end up with inconsistencies between the copies.
Having said all that, in BigQuery, you may want to do the unthinkable and denormalize your data and add redundancy into it. Why would you do that? Well, because it can make your queries much faster, and because the downsides of having redundant data are not a problem for what we’re doing. First, data integrity is not an issue because BigQuery isn’t a transaction processing system. We’re just using it for reporting, not for recording transactions, so we’re not going to create inconsistencies in the data, since we’re not going to modify the data at all.
Second, the extra cost associated with storing and querying extra copies of data is often outweighed by the much faster query times. Why does denormalization make queries so much faster? Because when you query a normalized database, you will often have to join multiple tables together, but joining tables is a very time-consuming operation. It’s not very noticeable when the tables are small, but as the tables get larger, the time required to join them increases exponentially.
This performance comparison done by Google shows just how much of a difference it can make. The slope of this line gets so steep that it would take a ridiculously long time to join tables that have billions of rows.
Let’s go through an example of how to denormalize a data structure. We’ll use some MusicBrainz data, which is a freely available encyclopedia of music. Here are three data files to import into tables.
Create a new table in the examples dataset. Change the Location to Google Cloud Storage and paste the Data File URL. Then change the File format to JSON. Call the table “artist”. Under the Schema, click “Edit as Text”, then paste in the contents of the schema file...and create the table.
Now do the same for the other two tables. I’ll show the process, but I’ll speed it up so it all happens in 15 seconds, so hang on.
OK, here’s how these three tables are organized. They’re normalized, so there isn’t any redundant data. Now we’re going to take columns from these three tables and combine them into one table. We’re going to join these tables and save the results so that when we need to run queries in the future, they don’t have to go through the join step first, which will save a lot of time. We’re doing an inner join on the three tables using the artist. The resulting table will have lots of duplicate information, such as multiple copies of an artist’s name and multiple copies of a recording’s name.
Click Show Options and set the Destination Table to “recording_by_artist”. While you’re here, make sure the “Use Legacy SQL” box is not checked. Alright, now run the query, which should take about 30 or 40 seconds.
Since this query took a long time, it says, “We can notify you when long-running queries complete.” That could be helpful in the future, so click “Enable notifications”. Then you have to click Allow.
Let’s see how much extra space it takes up. The recording table takes up about 1.5 gig...and the other two tables...bring the total up to about 1.7 gig. The denormalized table takes up about 2.3GB, so it’s about 35% bigger, but the speed gains would be worth it if we were dealing with larger tables.
Let’s get rid of the Destination Table option so we don’t accidentally overwrite it. This table only has about 18 million rows, so the speed gain won’t be very noticeable. One benefit is much simpler queries, though. Compare this...with this.
Both queries do exactly the same thing.
You can see why this table takes up more space. The name “Red Elvises” is repeated many times. In the original tables, it only had the artist’s ID for each recording, and the artist’s name was only listed once in the artist table.
By the way, you probably wouldn’t use the web interface to denormalize data unless it’s a one-time conversion. If you need to denormalize data on a regular basis, then you would want to automate the process using Cloud Dataflow or something similar.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).