Getting Data In
Getting Data Out
The course is part of these learning paths
BigQuery is Google’s managed data warehouse in the cloud. BigQuery is incredibly fast. It can scan billions of rows in seconds. It’s also surprisingly inexpensive and easy to use. Querying terabytes of data costs only pennies and you only pay for what you use since there are no up-front costs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account. You do not need any prior knowledge of Google Cloud Platform and the only pre-requisite is having some experience with databases.
- Load data into BigQuery using files or by streaming one record at a time
- Run a query using standard SQL and save your results to a table
- Export data from BigQuery using Google Cloud Storage
Sometimes you need to export data from BigQuery, such as when you want to use third-party tools on the data. Exporting is pretty easy, but there's only one place you can export the data to, and that's Cloud Storage. So if you want to export data anywhere else, you have to export it to Cloud Storage first and then download it from there.
Let's do that with the baby names data. Put your mouse pointer over the table name. Click on the down arrow and select Export table. It supports the same formats as when you're uploading data: CSV, JSON, and Avro. Just for something different, let's select JSON. Then you can choose whether to compress it or not. GZIP is the only compression option. We might as well select that, too. Now, you need to specify the Cloud Storage location where you want to save the file. I'll put it in my ca-example bucket, but you'll have to put it somewhere else, of course. Make sure you have right access to whatever bucket you specify. You also have to put in the file name. I'll call it babynames_2016.json. You'll also want to add a .gz extension to the file name, because BigQuery doesn't do that for you, so it won't be obvious that this is a compressed file unless you add the extension. Then click the OK button.
It tells you that it's extracting the data from the table. It takes a little while, so I'll fast-forward. Now, if I go to the Cloud Storage bucket, I can see that the file was created. If you click on the file name, it'll download it to your computer. Now if you open it up, assuming that you have something that will open a GZIP file, you'll see the data in JSON format, which looks far different from the CSV file we originally uploaded into the table.
If you want to see a history of the exports you've done, click on Job History. If you click on one of the jobs in the list, it will give you more detail. Notice that it also lists other types of jobs, such as when we loaded the data originally. You can even re-run a load job from here if you want.
You can also use the bq command to run an export job, but the option is called extract rather than export. That is, you use the bq extract command.
As you've seen, exports are quite straightforward, but they get a little more complicated if you need to export more than a gig of data. Here, I'll show you what happens. Take a look at the github_nested table. It's 1.58 gig. I'll try to export it, and I'll even compress it, so hopefully the exported file will be less than one gig.
It says one error next to Job History. Click on Job History, and then click on the error. It says that it's too large to be exported to a single file. Specify a URI including a * to shard export. That second sentence is a little cryptic, isn't it? What it means is you have to include a wildcard so it knows to export the data into multiple files.
Here's how to do that. I'll export it again. This time, I'll put an asterisk after the github. You can put the wildcard anywhere in the path except for the bucket name, but putting it just before the first file extension is usually the best place to put it, as you'll see in a second.
It's running this time, which is a good sign. It'll take longer, so I'll fast-forward again. Now it's done, so if I go back to Cloud Storage and refresh the page, you'll see six new files with long numbers in them. BigQuery simply appends numbers staring from zero and going up by one for every file, but it puts in 12 digits with all zeroes at the beginning, just in case it needs to split the data into a lot of files.
Also, notice that the sum of those six files is nowhere near one gig. It's only about half a gig. That's because BigQuery looks at the size of the source data rather than estimating the size of the destination file when it decides whether you have to split it into multiple files or not.
Okay, if you don't want to incur ongoing storage charges for the data you've saved in this course, then you should remove it. First, delete the files you exported to Cloud Storage. Select all seven files and then click Delete. Now go back to BigQuery and delete the public data set you created. You have to type public to confirm the deletion. Then delete the streaming data set and type streaming. There, now everything you loaded or exported in this course should be gone.
And that's it for this lesson.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).