Getting Data In
Getting Data Out
The course is part of these learning paths
BigQuery is Google’s managed data warehouse in the cloud. BigQuery is incredibly fast. It can scan billions of rows in seconds. It’s also surprisingly inexpensive and easy to use. Querying terabytes of data costs only pennies and you only pay for what you use since there are no up-front costs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account. You do not need any prior knowledge of Google Cloud Platform and the only pre-requisite is having some experience with databases.
- Load data into BigQuery using files or by streaming one record at a time
- Run a query using standard SQL and save your results to a table
- Export data from BigQuery using Google Cloud Storage
It's great having public datasets in BigQuery that you can use. But what about analyzing your own datasets? How do you get them in to BigQuery? There are many ways to do this. If your data is already in another Google service, then there's usually a way to get it in to BigQuery, although sometimes it requires an intermediate step. Some of the most commonly used sources are Cloud Storage, Google Drive, Cloud Datastore, Stackdriver, Cloud Bigtable, and Google Analytics.
If your data isn't in a Google service, then you can upload it to BigQuery through the Web Interface, the Command Line, or the API. Uploading though the web interface is the simplest, although it does have limitations.
Suppose you want to upload a dataset of the number of occurrences of all baby names across the US for a particular year. Since you don't already have a copy of this dataset, you'll need to download it from the Social Security website. Here's the URL (http://www.ssa.gov/OACT/babynames/names.zip). Then unzip the file. You can see that it contains one text file for each year, from 1880 to 2016. Let's have a look at the 2016 file.
It's a very simple file, with a name, a gender, and the number of occurrences of that name for babies born in 2016. Although the file name extension is TXT, it's actually a comma-separated values file, or CSV. BigQuery can upload files in three formats: CSV, JSON, and Avro.
Before you can upload a file, you need to create a dataset and table in BigQuery to hold it. Let's call it babynames.
Now you need to create a table in that dataset. It isn't obviously how to do that, but if you put your mouse pointer over the dataset name, a drop down arrow will show up to the right of it. Click on it and select Create new table. You can also click the plus sign to the left of the down arrow.
The location should already be set to File upload. If you click on the menu, you'll see that you could also choose Cloud Storage, Google Drive, or Cloud Bigtable. If your data were in one of those three services, then you could load it into BigQuery from here, or you could even leave the data where it is, and make it an external data source. However, the performance is usually slower when you query an external data source than if the data resides in BigQuery Storage. So it's often better to copy the data into BigQuery instead.
Okay, back to the task at hand. Leave the location on File upload and click the Choose file button. Then select the YOB2016.txt file. The file format should already be set to CSV, so just leave it at that. Now you need to give your table a name. Let's call it names_2016.
You'll notice there's an option to automatically detect the schema. That'll save you from having to enter the schema manually, so check that box. Now click Create Table, and after a few seconds, it says it uploaded the file.
Now, let's see if it detected the schema properly. Considering how simple the schema is for this file, it's unlikely that BigQuery would get it wrong, but let's look anyway. Click on the table name. It says there are three fields: a string, a bullion, and an integer. That's correct, because the file has a name, then either male or female, then the number of occurrences. The only problem is that the field names are not very descriptive. For example, the name field is called string_field_0. That's because the first line of the file didn't list the field names, so BigQuery didn't know what they should be called. If you have a data file like that, you could either add a header line with the field names, or specify the schema before you upload the file.
Okay, if you click on the Preview tab, you'll see a sample of the data. That looks right. There's a message above saying that if the schema is wrong, then you can rerun the load job with an adjust schema. Although the schema's correct, you might want to use this opportunity to specify the schema and upload the file again, but we're just going to leave it as is.
At this point, you could run queries on this table, just like you did with the table in the public dataset.
Okay, that was all pretty easy, but remember when I said that the web interface has limitations? One big limitation is that you can only upload files that are 10 megabytes or less in size. There are lots of data files that are bigger than that. Here's an example. It's a 35 meg file that contains over 200 thousand questions and answers from the game show Jeopardy. You can download this file from here. If you try to load it using the web interface, it'll give you an error.
In cases like this, you need to use the command line. Alternatively, you could upload it to Cloud Storage first, then move it to BigQuery, but that would be a bit of a hassle, so it's better to use the command line.
If you haven't used any of Google Cloud's command line tools yet, that is, gcloud, gsutil, kubectl, or bq, then you'll have to install the Google Cloud SDK. The installation instructions are at https://cloud.google.com/sdk/docs. If you need to install the SDK, then pause this video while you do that.
Okay, first you need to create a table to put data in, which we could do through the web interface again, but let's use the command line for that too. Before we create the table, we need to have a dataset to put the table in. If we had used a more generic name for the previous dataset instead of babynames, then we could have created a table for the Jeopardy data in that dataset too. Let's not make that mistake again. What should we call it? Maybe we should put all of the data that we downloaded from the internet in it. We could call it downloads, or something like that. But maybe we should just call it public.
The command for all BigQuery operations is bq. To create a dataset, type bq and then mk, for make, and then the name of the dataset, which is public, in this case.
Okay, the dataset was created. Now we need to create a table. Let's call the table Jeopardy. There are a couple of ways to create the table. You could create an empty table and then upload the data into it, or you could do it all in one step, which is usually easier. To upload a file, type bq load, then the autodetect flag, which tells it to automatically detect the schema, so you don't have to specify the schema yourself. Then type the name of the table you want to load it into. If you haven't set a default dataset, then you also need to specify the dataset name. In this case, you'd type public.jeopardy then type the file name you want to upload, which is jeopardy_CSV.csv. If you're not in the directory where the CSV file resides, then you'll have to put in the pathname to that file. And that will take a little while to upload the file.
By the way, another reason to use the command line instead of the web interface, is if you need to upload lots of files at the same time. With the bq command, you can put an asterisk in the file name, which will act as a wild card and upload all matching files.
Let's have a look in the web interface again, to make sure it uploaded properly. You have to refresh the page first so you can see the updates. Great, there's the public dataset and there's the Jeopardy table. Click on the Jeopardy table and then go to the preview tab. It looks good. Did you notice that the call names are actually descriptive? They have names like Show Number and Category instead of generic names like string field zero in the babynames table. And that's because the first line of this CSV file listed field names.
On a different topic, you might be wondering why we didn't just rename the babynames dataset to public instead of creating a new dataset. Well, that's because you can't rename a dataset in BigQuery. The only way to do it is to create a new dataset, copy all of the tables from the old dataset to the new dataset, and then delete the old dataset and its tables. So choose your dataset names carefully.
Let's move the names_2016 table from the babynames dataset to the public dataset. In the menu next to the names 2016 table, select Copy table, now change the Destination dataset to public, and change the Destination table to babynames_2016. Click OK. And it takes a few seconds to do the copy. Now go into the menu next to the babynames dataset and select Delete dataset. This will delete the dataset and all of the tables in it, so you have to type the name of the dataset before it will delete it. There, it's done.
And that's it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).