BigQuery is Google’s managed data warehouse in the cloud. BigQuery is incredibly fast. It can scan billions of rows in seconds. It’s also surprisingly inexpensive and easy to use. Querying terabytes of data costs only pennies and you only pay for what you use since there are no up-front costs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account. You do not need any prior knowledge of Google Cloud Platform and the only prerequisite is having some experience with databases.
Learning Objectives
- Load data into BigQuery using files or by streaming one record at a time
- Run a query using standard SQL and save your results to a table
- Export data from BigQuery using Google Cloud Storage
Intended Audience
- Anyone who is interested in analyzing data on Google Cloud Platform
Prerequisites
- Experience with databases
- Familiarity with writing queries using SQL is recommended
- A Google Cloud Platform account is recommended (sign up for a free trial at https://cloud.google.com/free/ if you don’t have an account)
Resources
The GitHub repository for this course is at https://github.com/cloudacademy/bigquery-intro.
Now that you know how inexpensive it is to store data on BigQuery, you may want to save some of your query results there too.
It’s very easy to do that. First, though, since you don’t have a dataset to put tables in, you’ll need to create one. Click your project name and then click “Create Dataset”. You need to give it a name, so call it “babynames”. That’s the only thing you have to specify in this dialog box, but if you want to, you can set the location, which you might want to do for compliance reasons. You can also set an expiration time so that a certain number of days after a table has been created, it will be automatically deleted. You would do this if it’s temporary data that you don’t want to have to remember to delete later. Now click “Create dataset”, and you should see your new dataset show up under your project name.
Great, now we’re ready to run a query. Let’s open one of our previous queries, so we don’t have to type a new one in. If the list of your previous queries isn’t already showing, then click “Query History”. Click the button to the right of your first query to bring it up again.
Now before you re-run this query, select “Query settings” from the menu...and then select “Set a destination table for query results”. It has already set the right project and dataset name. Now we just need to tell it which table to use. If the table doesn’t already exist, it’ll create it, so you can type in whatever table name you want. Let’s call it “babynames_top10”. Click Save and then click “Run”.
Now if you click on the babynames dataset, you’ll see that it created the “babynames_top10” table. Then click the “Preview” tab and you’ll see the results from the query.
What if you decide to save the results to a table after you’ve run the query? That’s easy too. Click “Query History” and click on the first query. We didn’t specify that the results should be saved to a table when we ran this query the first time, but if you scroll down, you’ll see that it saved the results to a temporary table. Click that to see the table. Now you can click “Copy Table” and specify where you want to save it. You have to select the dataset first and then type in a new table name.
One more thing to be aware of is that if you don’t specify a destination table and it puts the results in a temporary table, the temporary table stays in cache for about a day. So if you run the query again within 24 hours, it’ll retrieve the cached copy and you won’t be charged for the query.
However, if you run a query again and specify a destination table, like we just did, then it won’t read the data from cache. So let’s run it again without the destination table option.
If you go to the “Job information” tab, you can see that it didn’t process any bytes because the results were cached. Also, the duration of the query was zero seconds because it just retrieved the results from cache rather than running the actual query. Of course, since we could have just looked at the results in the temporary table, there may not seem to be a lot of point in re-running the query. That’s probably true if you’re using the web interface, but if you’re using the bq command or the BigQuery API, then it might be useful in some cases.
Before we go, let’s get rid of the table we created. Click on the dataset name, and then click “Delete Dataset”. This will delete the dataset and all of the tables in it, so you have to type the name of the dataset before it’ll delete it.
Okay, that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).