Getting Data In
Getting Data Out
The course is part of these learning paths
BigQuery is Google’s managed data warehouse in the cloud. BigQuery is incredibly fast. It can scan billions of rows in seconds. It’s also surprisingly inexpensive and easy to use. Querying terabytes of data costs only pennies and you only pay for what you use since there are no up-front costs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account. You do not need any prior knowledge of Google Cloud Platform and the only prerequisite is having some experience with databases.
- Load data into BigQuery using files or by streaming one record at a time
- Run a query using standard SQL and save your results to a table
- Export data from BigQuery using Google Cloud Storage
- Anyone who is interested in analyzing data on Google Cloud Platform
- Experience with databases
- Familiarity with writing queries using SQL is recommended
- A Google Cloud Platform account is recommended (sign up for a free trial at https://cloud.google.com/free/ if you don’t have an account)
The GitHub repository for this course is at https://github.com/cloudacademy/bigquery-intro.
So far, we’ve run queries on public datasets and on pre-existing data that we loaded into BigQuery. But there’s another way to get data into BigQuery -- streaming, where you add data one record at a time instead of a whole table at a time. One example is if you need a real-time dashboard that gives you an up-to-the-minute (or even up-to-the-second) view of your data.
One common way to stream data into BigQuery is to use Cloud Dataflow, but that’s the subject of another course. You can stream data directly into BigQuery by calling the API in your software. I’ll show you an example using the API Explorer, which lets you make an API call from a nice interface in your browser.
Before we do that, we need to create a table to stream the data into. We should also create a new dataset because we’re not going to put public data in the table, so the public dataset isn’t the right place to put it. Let’s call it “streaming”. Now create the table. This time, we’re going to leave it as “Empty table” because we’re not loading any data into it right now. Let’s call the table “stream”.
For the schema, we could put in pretty much anything, since we’re just going to stream some test data, but let’s have at least two fields. Create a string field called “greeting” and a string field called “target”. And click “Create table”.
Okay, now we’ll go to the BigQuery API documentation. The URL is in the GitHub repository for this course. We’re going to use the “tabledata.insertAll” method to stream data into the table.
On the right-hand side is a form you can fill out to call the API. First, put in your project ID (which will be different from mine, of course). You can get your project ID from BigQuery. Then put in “streaming” for the datasetId, and “stream” for the tableId.
Now, in the “Request body” text box, click the plus sign and select “rows”. This saves you from having to type in everything yourself. It’s especially helpful with the brackets because we’re going to have quite a few nested brackets in a minute. Now click the plus sign under “rows” and select “[Add Item]”. It put in a couple more brackets. Then click the plus sign between those and select “json”.
Now we finally need to type something. This is where we say what data we want to put in the row we’re adding to the table. If you’ll recall, the first field is called “greeting”, so type that between quotes after the curly bracket. Then type colon quote Hello quote comma. Hit Enter. We called the second field “target”, so type quote target quote colon quote world quote.
All right, that’s a complete record, so the request body is complete. It added a bunch of junk at the bottom for some reason, so just delete those lines. Good, the errors went away.
Now click the “Execute” button. Okay, the return code is 200 and it’s colored green, which means the API call was successful. It also didn’t return an error, so it looks like it worked. Let’s see.
Go back to the BigQuery page and click on the “stream” table. Click the “Preview” tab. Great, there’s our “Hello world” message that we just sent.
Although that was a quick way to show you how streaming works, you likely won’t be calling the API directly when you write your own streaming code. Instead, you should use the BigQuery client libraries.
Google provides client libraries for C#, Go, Java, Node.js, PHP, Python, and Ruby. There’s a different way of doing a streaming insert for each language, but it’s easier than writing the API call yourself. If you need to write code to stream data into BigQuery, have a look at the documentation for your language.
Oh, and one more thing. Remember in the pricing lesson when I said that there’s a separate charge for streaming? Well, it’s 5 cents per GB to do streaming inserts. That actually makes it the most expensive BigQuery operation. Loading data in any other way is free, querying data costs half a cent per gig, and storing data costs 2 cents per gig, at most. But if you need to do up-to-the-minute analysis of streaming data, then 5 cents a gig is still pretty cheap.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).