Getting Data In
Getting Data Out
BigQuery is Google’s managed data warehouse in the cloud. BigQuery is incredibly fast. It can scan billions of rows in seconds. It’s also surprisingly inexpensive and easy to use. Querying terabytes of data costs only pennies and you only pay for what you use since there are no up-front costs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account. You do not need any prior knowledge of Google Cloud Platform and the only prerequisite is having some experience with databases.
- Load data into BigQuery using files or by streaming one record at a time
- Run a query using standard SQL and save your results to a table
- Export data from BigQuery using Google Cloud Storage
- Anyone who is interested in analyzing data on Google Cloud Platform
- Experience with databases
- Familiarity with writing queries using SQL is recommended
- A Google Cloud Platform account is recommended (sign up for a free trial at https://cloud.google.com/free/ if you don’t have an account)
The GitHub repository for this course is at https://github.com/cloudacademy/bigquery-intro.
Sometimes you need to export data from BigQuery, such as when you want to use third-party tools on the data. Exporting is pretty easy, but there is only one place you can export the data to and that’s Cloud Storage. So if you want to export data anywhere else, you have to export it to Cloud Storage first, and then download it from there.
Let’s do that with the babynames data. Click the table name, then select “Export to GCS” from the “Export” menu. It supports CSV, JSON, and Avro formats. Just for something different, let’s select JSON. Then you can choose whether to compress it or not. GZIP is the only compression option. This is a pretty small file, so let’s not bother compressing it.
Now you need to specify the Cloud Storage location where you want to save the file. I’ll put it in my ca-example bucket, but you’ll have to put it somewhere else, of course. Make sure you have write access to whatever bucket you specify. You also have to put in the filename. I’ll call it babynames_2019.json. Then click the Select button...and the Export button.
It tells you that it started an export job. It doesn’t take long to finish. Now if we go to the Cloud Storage bucket, we can see that the file was created. If you click on the filename, you can download it to your computer.
Now if you open it up, you’ll see the data in JSON format, which looks far different from the CSV file we originally uploaded into the table.
If you want to see a history of the exports you’ve done, click on Job History. If you click on one of the jobs in the list, it’ll give you more detail. Notice that it also lists other types of jobs, such as when we loaded the data originally. You can even re-run a load job from here if you want.
You can also use the bq command to run an export job, but the option is called “extract” rather than “export”. That is, you use the “bq extract” command.
As you’ve seen, exports are quite straightforward, but they get a little more complicated if you need to export more than a gig of data. Here, I’ll show you what happens. Take a look at the games_wide table in the baseball dataset. It’s 1.76 gig. I’ll try to export it, and I’ll even compress it, so hopefully the exported file will be less than one gig.
It gives us an error. It says that it’s “too large to be exported to a single file. Specify a uri including a * to shard export.” That second sentence is a little cryptic, isn’t it? What it means is you have to include a wildcard so it knows to export the data in multiple files.
Here’s how to do that. I’ll export it again, and this time, I’ll put an asterisk after “games”. You can put the wildcard anywhere in the path except for the bucket name, but putting it just before the first file extension is usually the best place to put it as you’ll see in a second.
It’s running this time, which is a good sign. It’ll take a lot longer, so I’ll fast forward. Now it’s done, so if I go back to Cloud Storage, you’ll see two new files with long numbers in them. BigQuery simply appends numbers starting from 0 and goes up by one for every file, but it puts in 12 digits with all zeros at the beginning, just in case it needs to split the data into a lot of files.
Also notice that the sum of those two files is nowhere near one gig. It’s less than a hundred meg. That’s because BigQuery looks at the size of the source data rather than estimating the size of the destination file when it decides whether you have to split it into multiple files or not.
Okay, if you don’t want to incur ongoing storage charges for the data you’ve saved in this course, then you should remove it. First, delete the files you exported to Cloud Storage. Select all three files, and then click “Delete”. Now go back to BigQuery, and delete the public dataset you created. You have to type “public” to confirm the deletion. Then delete the streaming dataset. There, now everything you loaded or exported in this course should be gone.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).