Getting Data In
Getting Data Out
The course is part of these learning paths
BigQuery is Google’s managed data warehouse in the cloud. BigQuery is incredibly fast. It can scan billions of rows in seconds. It’s also surprisingly inexpensive and easy to use. Querying terabytes of data costs only pennies and you only pay for what you use since there are no up-front costs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account. You do not need any prior knowledge of Google Cloud Platform and the only prerequisite is having some experience with databases.
- Load data into BigQuery using files or by streaming one record at a time
- Run a query using standard SQL and save your results to a table
- Export data from BigQuery using Google Cloud Storage
- Anyone who is interested in analyzing data on Google Cloud Platform
- Experience with databases
- Familiarity with writing queries using SQL is recommended
- A Google Cloud Platform account is recommended (sign up for a free trial at https://cloud.google.com/free/ if you don’t have an account)
The GitHub repository for this course is at https://github.com/cloudacademy/bigquery-intro.
It’s great having public datasets in BigQuery that you can use, but what about analyzing your own datasets? How do you get them into BigQuery? There are many ways to do this. If your data is already in another Google service, then there’s usually a way to get it into BigQuery, although sometimes it requires an intermediate step. Some of the most commonly used sources are Cloud Storage, Google Drive, Firestore, Cloud Operations, Bigtable, and Google Analytics.
If your data isn’t in a Google service, then you can upload it to BigQuery through the web interface, the command line, or the API. Uploading through the web interface is the simplest, although it does have limitations.
Suppose you want to upload a dataset of the number of occurrences of all baby names across the US for a particular year. Since you don’t already have a copy of this dataset, you’ll need to download it from the Social Security website. The URL is in the GitHub repository for this course. Then unzip the file. You can see that it contains one text file for each year from 1880 to 2019. Let’s have a look at the 2019 file.
It’s a very simple file, with a name, a gender, and the number of occurrences of that name for babies born in 2019. Although the filename extension is “txt”, it’s actually a comma-separated values file (or CSV). BigQuery can upload files in 5 formats: CSV, JSON, Avro, ORC, or Parquet.
Before you can upload a file, you need to create a dataset and table in BigQuery to hold it. Let’s call it “babynames”.
Now you need to create a table in that dataset, so click the “Create Table” button. Change the Source to “Upload”. If you click on the menu, you’ll see that you could also choose Cloud Storage, Google Drive, or Cloud Bigtable. If your data were in one of those three services, then you could load it into BigQuery from here or you could even leave the data where it is and make it an external data source However, the performance is usually slower when you query an external data source than if the data resides in BigQuery storage, so it’s often better to copy the data into BigQuery instead.
Okay, back to the task at hand. Set the Location to “Upload” and click the “Browse” button. Then select the “yob2019.txt” file. Change the file format to CSV. Now you need to give your table a name. Let’s call it “names_2019”.
You’ll notice there’s an option to automatically detect the schema and input parameters. That’s often a very handy feature because it saves you from having to enter the schema manually. Let’s see if it works with this data file.
It got an error. The message is a little bit cryptic, but here’s what happened. When BigQuery tries to detect the schema, it only looks at the first 100 records. In this file, the first 100 records all have ‘F’ in the second column. BigQuery assumes this means that the second column can be either an ‘F’ for “False” or a ‘T’ for “True”, so it sets this column to Boolean. When it tries to upload records with ‘M’ in that column, it gets an error because it’s expecting an ‘F’ or a ‘T’.
In situations like this, you have to specify the schema manually. Fortunately, it’s pretty easy in this case. First, I’ll redo everything except the schema.
Okay, now instead of asking it to auto-detect the schema, click “Add field”. The first field is the name, so type “name”. It’s a string, so the type is set correctly. The mode is set to “Nullable” by default, which means that this field can be empty for some records. If we wanted to make sure that no records were missing the name, then we would set the mode to “Required”. This file isn’t missing any names, but let’s leave it as Nullable anyway.
Now we’ll add the second field. Call it “gender”, and leave it as a string. The third field is the number of people who have this name, so call it “count”. It’s a whole number, so set the Type to “Integer”. Okay, now click “Create table”. It only takes a couple of seconds to upload all of the data into the table.
All right, go to the table and click on the Preview tab to see a sample of the data. That looks right. At this point, you could run queries on this table just like you did with the table in the public dataset.
Okay, that was all pretty easy, but remember when I said that the web interface has limitations? One big limitation is that you can only upload files that are 10 megabytes or less in size. There are lots of data files that are bigger than that. Here’s an example. It’s a 35 meg file that contains over 200,000 questions and answers from the game show, Jeopardy. You can find the URL for this file in the GitHub repository. If you try to load it using the web interface, it’ll give you an error.
In cases like this, you need to use the command line. Alternatively, you could upload it to Cloud Storage first and then move it to BigQuery, but that would be a bit of a hassle, so it’s better to use the command line.
If you haven’t used any of Google Cloud’s command line tools yet (that is, gcloud, gsutil, kubectl, or bq), then you’ll have to install the Google Cloud SDK. The installation instructions are at this URL. If you need to install the SDK, then pause this video while you do that. .
Okay, first you need to create a table to put data in, which we could do through the web interface again, but let’s use the command line for that too. Before we create the table, we need to have a dataset to put the table in. If we had used a more generic name for the previous dataset, instead of “babynames”, then we could have created a table for the Jeopardy data in that dataset too. Let’s not make that mistake again. What should we call it? Maybe we should put all of the data that we downloaded from the internet in it. We could call it “downloads” or something like that, but maybe we should just call it “public”.
The command for all BigQuery operations is “bq”. To create a dataset, type “bq”, and then “mk” for make, and then the name of the dataset, which is “public” in this case.
Okay, the dataset was created. Now we need to create a table. Let’s call the table “jeopardy”. There are a couple of ways to create the table. You could create an empty table and then upload the data into it or you could do it all in one step, which is usually easier.
To upload a file, type “bq load”, then the autodetect flag, which tells it to automatically detect the schema so you don’t have to specify the schema yourself. Then type the name of the table you want to load it into. If you haven’t set a default dataset, then you also need to specify the dataset name. In this case, you would type “public.jeopardy”. Then type the filename you want to upload, which is JEOPARDY_CSV.csv. If you’re not in the directory where the csv file resides, then you’ll have to put in the pathname to that file. It’ll take a little while to upload the file.
By the way, another reason to use the command line instead of the web interface is if you need to upload lots of files at the same time. With the bq command, you can put an asterisk in the filename, which will act as a wildcard and upload all matching files.
Let’s have a look in the web interface again to make sure the file uploaded properly. You have to refresh the page first so you can see the updates. Great, there’s the “public” dataset and there’s the “jeopardy” table. Click on the jeopardy table, and then go to the Preview tab. It looks good. Did you notice that the column names are actually descriptive? They have names like “Show Number” and “Category” instead of generic names like “string_field_0”. That’s because the first line of the csv file listed the field names.
On a different topic, you might be wondering why we didn’t just rename the “babynames” dataset to “public” instead of creating a new dataset. Well, that’s because you can’t rename a dataset in BigQuery. The only way to do it is to create a new dataset, copy all of the tables from the old dataset to the new dataset, and then delete the old dataset and its tables. So choose your dataset names carefully.
Let’s move the names_2019 table from the babynames dataset to the public dataset. Click on it, then select “Copy Table”. Now change the destination dataset to “public” and call the destination table “babynames_2019”. Click Copy. It takes a few seconds. Now click on the “babynames” dataset, and click “Delete dataset”. This will delete the dataset and all of the tables in it, so you have to type the name of the dataset before it will delete it. There, it’s done.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).