The course is part of these learning paths
BigQuery is Google's incredibly fast, secure, and surprisingly inexpensive data warehouse, but there are ways to make it even faster, cheaper, and more secure.
Here are some examples of what you will learn in this course:
- BigQuery can process billions of rows in seconds, but only if you break the rules of relational database design.
- If you are analyzing relatively small amounts of data, then your queries won’t cost very much, but if you regularly analyze huge datasets, then your costs can add up quickly. However, with a few adjustments to how you store your data in BigQuery, you can run queries for a fraction of the cost.
- To give you the flexibility to implement fine-grained security, BigQuery has several layers of access control capabilities, but they can be confusing, so I’ll show you which ones to use to meet your organization’s requirements.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
- Reduce your BigQuery costs by reducing the amount of data processed by your queries
- Create, load, and query partitioned tables for daily time-series data
- Speed up your queries by using denormalized data structures, with or without nested repeated fields
- Implement fine-grained access control using roles and authorized views
- Database administrators
- Anyone who wants to learn how to get the most out of Google BigQuery
To get the most out of this course, you should already have some experience with BigQuery. If you don’t, then please take Introduction to Google BigQuery first.
The GitHub repository for this course can be found at https://github.com/cloudacademy/optimizing-bigquery/.
Denormalizing data might seem like a pretty clunky way to increase query performance. Is there a way to improve performance without having to add lots of redundant data? Yes. BigQuery supports something called “nested repeated fields”. This is a way to combine data into one table without redundancy.
Google has provided a very small data file to demonstrate this concept. First, open it in your browser and then right-click and do a Save As.
Then upload it into a new table. Change the file format to JSON. Call it “persons_data”. Now open the schema file...and copy the contents here. It’s pretty hard to understand the schema by looking at this, but it will be easier to see once you’ve created the table.
Now have a look at the schema. Some of the fields have a dark gray background, like phoneNumber. And under phoneNumber, you’ll see that areaCode and number look like subfields.
It gets more interesting with the children and citiesLived fields. Not only do they have nested fields under them, but they are also repeated fields, so each person can have multiple children and citiesLived.
Let’s go to the Preview to see what the data looks like. This is pretty different from a normal record because it takes up two rows and there are blanks at the beginning of the second row. One thing that can be confusing is that it looks like the child Jane lived in Seattle in 1995 and the child John lived in Stockholm in 2005 because in a normal record, all of the pieces of data on one row make up the entire record. But with repeated records, that’s not the case.
It’s easier to see if you look at the next record. First of all, either Mike Jones has a really hard time deciding where he wants to live or there’s something wrong with this data. Anyway, here it’s obvious that the children fields are not directly related to the citiesLived fields. Another way to see this is to look at the JSON representation of this record. Click on the JSON button. It’s not as easy to read as the table view, but it makes it very obvious how the record is organized.
To refer to a nested field, you can use the dot notation that it shows in the schema. For example, to get a list of all of the people in the table and their phone numbers, you would run this.
You refer to the phone number as phoneNumber.number. You don’t need the “LIMIT 1000” because there are only 3 records in this table. It doesn’t hurt to leave it there, but I prefer to keep unnecessary clutter out of my queries.
If you want to refer to a nested repeated field, though, then it takes a bit more work. Let’s say you want to see who has lived in Austin. It’s giving us an error because it doesn’t recognize the “place” field. That’s because it’s a nested repeated field, so you have to use the UNNEST function. First, put in a comma after the table name. Then unnest citiesLived. You’ll notice that the error went away because now that citiesLived is unnested, we can refer to the field as just “place” instead of citiesLived.place. Let’s also select place, so we can verify that the query is working properly.
It came back with the correct results because Anna Karenina and Mike Jones both lived in Austin, but John Doe didn’t.
What happens if you don’t specify the WHERE clause? Will it print all the places for each person?
Yes, it does, although it repeats the person’s name beside every city, unlike what we saw in the Preview tab. The records also come up in random order, because we didn’t specify an ORDER BY clause.
What if you do a SELECT *? Will it look like what we saw in the Preview tab?
Yes, it looks like the Preview tab...or does it? There are 9 rows in the result, but there are only 3 in the table. What’s going on? It’s because we left UNNEST(citiesLived) in the query. Now it’s very clear what the UNNEST does. It creates two new columns with the contents of the nested fields. And since our query says to select from the table and from UNNEST(citiesLived), it’s actually doing a JOIN. That’s what the comma between the two means. In fact, you could replace the comma with JOIN and it would work the same. There are 9 citiesLived records in total, one for each city that each person has lived in, which is why there are now 9 records in the result.
OK, so if we remove UNNEST(citiesLived), will it look like the Preview tab?
Ignore the warning. Yes, it does, although the records are in random order, of course.
In this example, we uploaded a JSON file that already contained nested repeated data. If you want to do this with your own data and it’s not already in that format, then you’ll have to create it yourself. There are many ways of doing this, such as using Cloud Dataflow, but that’s outside the scope of this course.
If you find that you frequently need to unnest the same repeated fields in your queries, then you might want to create a view.
Here’s a really simple example. Open the query where you selected the person’s name and place, but didn’t use a WHERE clause.
Then click Save View. For the table name, use cities_by_person. You’ll notice that the icon to the left of the name is different from the other ones. That’s because a view isn’t a table, even though it asked us to put in a table name when we created it. If you click on the view, you’ll see that the schema looks like a table’s schema, but if you click on Details, you’ll see that it’s quite a bit different.
In fact, it doesn’t store any data. It just stores the query. So when you use a view in your queries, it actually runs the query again. So why would you want to use a view if it just runs the query again anyway? Well, it can make your queries less complicated. Here, I’ll show you.
First, replace the table name (persons_data) with the name of the view (cities_by_person). Now you can remove UNNEST(citiesLived). And remove the comma too. Let’s put in a WHERE clause to make it a useful query. Let’s search for all of the people who have lived in Stockholm.
OK, I realize that it probably wasn’t worth the effort of creating the view, and changing our queries to use the view, all to make the queries a little simpler, but if you were regularly running queries with multiple levels of UNNESTs, then it might be handy to create a view.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).