Getting Data In
Getting Data Out
The course is part of these learning paths
BigQuery is Google’s managed data warehouse in the cloud. BigQuery is incredibly fast. It can scan billions of rows in seconds. It’s also surprisingly inexpensive and easy to use. Querying terabytes of data costs only pennies and you only pay for what you use since there are no up-front costs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account. You do not need any prior knowledge of Google Cloud Platform and the only pre-requisite is having some experience with databases.
- Load data into BigQuery using files or by streaming one record at a time
- Run a query using standard SQL and save your results to a table
- Export data from BigQuery using Google Cloud Storage
To open BigQuery, if you're already in the Google Cloud console, then go down to the Big Data section of the menu, and select BigQuery. If you're not in the console, then go to bigquery.cloud.google.com.
Suppose you wanted to see which U.S. state had the most babies of the same name in one year. There's a public dataset with baby name data available on BigQuery. If you look under Public Datasets on the left, you'll see one called usa_names. To see the table name, you have to click on the arrow at the left. There are two tables that are almost the same. We'll use the first one.
If you click on it, you'll see the schema. If you click on the Details tab, it will show you a description of the data in the table. If you click on the Preview tab, it will give you a sample of the data. If you click the Query Table button, a New Query text box will appear, and it will even give you the skeleton of a SQL query.
But before you run a query, there's something you should do. You see, originally, BigQuery had its own version of SQL that was slightly different from standard SQL. Then in September, 2016, BigQuery 2.0 was released, and it supported the SQL 2011 standard. This is now the preferred query language, but unfortunately by default, BigQuery still assumes that you're using its legacy query language. So you should always make sure you select standard SQL before you type in any queries. To do that, click the Show Options button, and remove the check mark from the Use Legacy SQL box. Then click the Hide Options button, so the screen isn't so cluttered. Now we're ready to create a SQL query.
Let's select star from the table, which if you're not familiar with SQL, means select everything. You can see whether you syntax is correct, by looking at the validator circle at the bottom right. It's a red exclamation point, which means there's a problem. To see why, click on it, and open the validator. It doesn't like the square bracket, and it says if this is a table identifier, escape the name with a back quote. We're getting this error because legacy SQL uses square brackets around table names, but standard SQL uses back quotes. The easiest way to fix this is to click the Query Table button again, so it will put in a skeleton query that conforms to standard SQL. The last time we clicked the Query Table button, the legacy SQL option was checked, so it gave us a skeleton query that conforms to legacy SQL. Okay, now click the button and watch what happens. It's just warning that the existing query will be erased. Okay, did you see what changed? It replaced the square brackets with back quotes. There also used to be a colon after bigquery-public-data, and now it's a period.
Okay, let's get back to typing in our query. First, put the asterisk back in, because it got erased when we clicked the Query Table button. Then, we should sort the results with the biggest number at the top. So use ORDER BY number DESC, for descending. And then, since we only need to see the top results, let's put in a limit of 10. This line is quite long now, so if you want to make it easier to read, click the Format Query button. That's better.
Now the exclamation point has turned into a green check mark. Now click the Run Query button. It only takes a few seconds to run.
The top result is Robert in New York in 1947, with 10,025 occurrences. You might be wondering if we did something wrong with this query, because all of the top ten names are boys names. Let's look only for girls names and see what happens. Add WHERE gender = 'F'. Remember to put quotes around the F.
Now it makes sense why we only saw boys names before. The highest number of occurrences for a girl's name was Mary in Pennsylvania in 1918, with 8,184 occurrences. Although that's a lot of Marys, there were 9,054 Roberts in New York in 1951, and that was the tenth highest number of occurrences. So no girls names showed up in the top 10.
Before we run any more queries, let's see how much this is costing us. I'll cover that in the next lesson.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).