image
Connecting to BigQuery

Contents

Introduction
1
Introduction
PREVIEW1m 33s
Using Data Studio
2
Creating a Report
PREVIEW10m 36s
Summary
5

The course is part of these learning paths

Start course
Difficulty
Intermediate
Duration
37m
Students
2310
Ratings
4.9/5
starstarstarstarstar-half
Description

Google Data Studio is a web-based application for creating reports and dashboards. It’s an easy-to-use tool for displaying your data visually. It was designed to help Google Analytics users create custom reports, but it can now read data from many sources, including BigQuery, Cloud SQL, and Cloud Storage.

In this course, you will learn how to connect a Data Studio report to a BigQuery dataset, visualize it with charts and graphs, and share it with your co-workers to make data-driven decisions.

Learning Objectives

  • Create a report in Data Studio
  • Connect a Data Studio report to a BigQuery dataset
  • Share a Data Studio report with appropriate levels of access

Intended Audience

  • Data professionals, especially those who work with big data
  • People studying for the Google Professional Data Engineer exam

Prerequisites

Transcript

In the last lesson, we used one of Data Studio’s sample data sources, but Data Studio has lots of connectors for bringing data in from other sources. To see what’s available, start a new report, and then click the “Create New Data Source” button.

A few of these connectors are for GCP services, including Cloud Storage, Cloud SQL, and BigQuery. To use the BigQuery connector, click on it. It asks you to authorize Data Studio to access your BigQuery data. Go ahead and do that. Then, you have to tell it where the BigQuery data resides. In most cases, you’ll probably need to select your own GCP project, but for this demo, we’re going to use a public dataset, so select that.

The one we want is the “hacker_news” dataset. If you’re not familiar with Hacker News, it’s a very popular news site hosted by Y Combinator. People upvote articles they like, which makes them appear higher in the list, although it’s not sorted in strict descending order, as you can see. Many factors determine the score of a news story, including how recently it was posted.

The hacker_news dataset contains a list of all of the stories, along with their scores, when they were posted, and a number of other details. In this lesson, we’re going to create a report that shows the highest-scoring stories that were posted yesterday.

Select the “hacker_news” dataset, then the table called “full”, and then your billing project. Now click the “Connect” button. It takes almost a minute, so I’ll fast forward.

It comes back with a list of all of the fields in the table. It has also taken a guess at which ones are dimensions and which ones are metrics. The dimensions are colored green and the metrics are colored blue. If you look closely, you can see that it just assumes that all numbers are metrics and everything else is a dimension. That’s a reasonable assumption, but it’s not completely accurate. For example, an id isn’t what we would normally think of as a metric because it isn’t a measurement, it’s just an arbitrary number.

Also notice the Aggregation column. It has set all of the metrics to “Sum” and all of the other ones to “None”. Sum means that when this metric is applied to a dimension, if there are multiple records with the same value for that dimension, then it adds them together. For example, in the last lesson, we displayed the Sessions metric against the Page Title dimension. The report aggregated all of the sessions for a particular page and displayed a single number for that page.

There are lots of other aggregation options, such as Average, Count, and Max, but we don’t have a reason to change our metrics to anything other than Sum, so leave them at that.

OK, now click the “Add to Report” button. Then click the table icon and draw a box. This doesn’t look very useful, does it. That’s because it chose “dead” for the dimension. To change it, click on “dead”. What we want to see is the title of each story, so select that. It takes a while because it has to query the table so it can update the report with the new dimension. From now on, I’ll fast forward whenever it does a query.

For the metric, it chose “parent”. Let’s change that too. What we want is the score for each story, so select that.

OK, that’s done. We should make the font a bit bigger too, so it’s easier to read.

Alright, now we have we want, which is the stories sorted by score, but we just want the stories that were posted yesterday, not all of them. Before we fix that, though, there’s something else you might have noticed. The top row has a blank title. To exclude records with no title, we can apply a filter.

Go back to the Data tab and then scroll down to the bottom and click “Add a Filter”. Then click “Create a Filter”. You can name this filter if you want, but you don’t have to. In the first dropdown, select “Exclude”. In the second dropdown, select “title”. Another dropdown appears so you can select a condition. It might seem like we should select “Is Null” because we want to exclude empty titles, but in this table, the blank titles are empty strings rather than nulls. So select “Equal to”. Then leave the value field blank. This will exclude titles with empty strings.

Now click the “Save” button. It has to run the query again. OK, the blank title is gone. Now let’s show only the stories that were posted yesterday. Could we do that with a filter? Let’s see. I’ll add another filter and I’ll include records where the time field is equal to… Well, I could put in yesterday’s date, but then I’d have to update this filter every day, which isn’t very user-friendly.

Another way would be to add a date range control. Let’s try it. Switch to view mode. You can select “Yesterday”, which looks promising, but unfortunately, it changes that to a specific date, so if we were to refresh this browser tab tomorrow, it would still show the same date.

By the way, at the bottom of the page, it says when the data was last updated. This is more interesting than it looks. For example, change the date range to last year. It changed the time when the data was last updated to just a few seconds ago, which is what you would expect. Now change the date range back to yesterday. It came back very quickly and the time at the bottom is now an earlier time. In fact, it’s the time we saw before. That’s because it retrieved the data from cache.

Data Studio has two levels of caching. The first level is called the query cache. When you request the exact same query (that is, the same dimensions, metrics, filters, and date range) as a previous query, then it will retrieve the data from the query cache. That’s what happened when we requested yesterday’s data again. It was the exact same query as we had requested before.

There are two benefits to retrieving from the query cache. First, it comes back much more quickly, and second, it doesn’t incur any additional charges in BigQuery.

If you don’t request a query that you’ve made before, then it tries the next level of cache, which is called the prefetch cache. Data Studio predicts what data you might request from the report and retrieves as much of it as possible into the prefetch cache. If it guesses correctly, then your requests come back quickly. If you make a request that it didn’t anticipate, then it goes back to BigQuery and runs the new query.

The benefit of the prefetch cache is purely performance. Unlike the query cache, it doesn’t save you money. In fact, it could cost you more. If it retrieves data that you never request, then you will incur BigQuery charges that you wouldn’t have otherwise. So Data Studio gives you the option to turn off the prefetch cache if you want.

Go back to edit mode and then select “Report settings” from the File menu. You’ll see that the “Enable cache” box is checked. To disable the prefetch cache for this report, just uncheck it.

The query cache, on the other hand, can’t be disabled. So what can you do if the data in BigQuery has changed, but your report is reading from the query cache? You can click “Refresh data” here. That refreshes both caches. Note that a user will only see the Refresh button if they have edit access to the report. Even if you don’t do a refresh, both caches will eventually expire, but it could take 12 hours or more before that happens.

OK, back to the task at hand. As you saw, you can’t use the date range control to automatically show yesterday’s data in the report.

In situations like this, there’s another solution. You can create a custom query. Click on the table again, the pencil to edit the data source. Then click “Edit Connection”. Now select “Custom Query” instead of “Public Datasets”. Then select your GCP project. Now it gives us a blank box where we can enter a query.

Of course, you’ll usually want to test your query in BigQuery itself to make sure it works before you enter it here, so we need to go to the BigQuery console. You could bring up the console and navigate to the full table in the hacker_news dataset, but there’s a shortcut that’ll take us right there. Click on “Public Datasets” and select “hacker_news”. Now put your mouse pointer over the “full” table, but this time click on the arrow symbol. It takes you to the schema of the full table in the BigQuery console.

Click the “Compose Query” button. To save you some typing, I put the SQL query you need at the bottom of the “Video Transcript” tab below this video, so you can copy it from there.

Paste it in the box. It’s written in standard SQL, so remember to uncheck the “Use Legacy SQL” box in the options.

This query selects the title, the score, and the timestamp for each story. The WHERE clause is how we only include stories that were posted yesterday. It compares each story’s timestamp to the current date minus one day. It also excludes empty titles, so we won’t have to create a filter for that in Data Studio. Then it sorts the records in descending order by score.

Click the “Run Query” button. It comes back with the data we need, so now let’s go back to Data Studio and click “Custom Query” again. Paste the SQL statement in the box. Make sure you uncheck the “Use Legacy SQL” box. Then click the “Reconnect” button. Click “Apply”.

This time it only shows the three fields we selected: score, title, and timestamp. Click “Done”. It says “Invalid dimension” because we removed the time field, which is what it was using for the Date Range Dimension. It’s not a problem because our report isn’t using the Date Range Dimension, so you can remove it if you want.

Now click on the table and change the data source by clicking on “full”. Select “BigQuery”. It updates the table with yesterday’s top stories.

Let’s make it look a bit nicer. Go back to the properties pane and click on the Style tab. Scroll down to the Columns section. In the first dropdown, select “Bar”. Now it’s sort of a sideways bar graph of the scores. We can also delete the date range control because we don’t need it anymore. You can either hit the Delete key or right-click and select “Delete”.

You could make this report more useful by including the URL for each story so you could go to the story by clicking on it. You’ll have to change the custom query to do that, so it’s a good exercise to get some practice. I’ll give you a hint, though. Data Studio thinks the url field in the Hacker News full table is just text, rather than a URL, so you’ll have to change the field type to make the links clickable.

One last thing before we go. If you want to see what queries the report has run against the BigQuery table, then go to the Query History in the BigQuery console.

And that’s it for this lesson.

SELECT title, score, timestamp
FROM `bigquery-public-data.hacker_news.full`
WHERE EXTRACT(DATE FROM timestamp) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
AND title != ""
ORDER BY score DESC

About the Author
Students
201126
Courses
97
Learning Paths
167

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).