The course is part of these learning paths
BigQuery is Google's incredibly fast, secure, and surprisingly inexpensive data warehouse, but there are ways to make it even faster, cheaper, and more secure.
Here are some examples of what you will learn in this course:
- BigQuery can process billions of rows in seconds, but only if you break the rules of relational database design.
- If you are analyzing relatively small amounts of data, then your queries won’t cost very much, but if you regularly analyze huge datasets, then your costs can add up quickly. However, with a few adjustments to how you store your data in BigQuery, you can run queries for a fraction of the cost.
- To give you the flexibility to implement fine-grained security, BigQuery has several layers of access control capabilities, but they can be confusing, so I’ll show you which ones to use to meet your organization’s requirements.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
- Reduce your BigQuery costs by reducing the amount of data processed by your queries
- Create, load, and query partitioned tables for daily time-series data
- Speed up your queries by using denormalized data structures, with or without nested repeated fields
- Implement fine-grained access control using roles and authorized views
- Database administrators
- Anyone who wants to learn how to get the most out of Google BigQuery
To get the most out of this course, you should already have some experience with BigQuery. If you don’t, then please take Introduction to Google BigQuery first.
The GitHub repository for this course can be found at https://github.com/cloudacademy/optimizing-bigquery/.
Data can be one of the most precious resources an organization owns. So it’s important to keep tight control of not only who’s allowed to read your data, but also who’s allowed to modify or delete it. That might seem pretty basic, but doing this with BigQuery is more complicated than it sounds.
BigQuery provides several layers of access control. The top layer is primitive roles, which act at the project level. They’re administered through IAM, the Identity and Access Management system.
There are three primitive roles: owner, editor, and viewer. When you add a member to a project, you can assign them one of these roles, which will apply to all Google Cloud Platform services, not just BigQuery. A viewer can view all datasets and run jobs (such as queries). Editors have viewer permissions, but can also modify or delete all tables. They can’t modify datasets, but they can create new datasets. An owner has editor permissions, but can also delete all datasets and see all jobs for all users in the project.
Primitive roles are fine as long as you have these very simple access control requirements:
- First, each user can have the same permissions for all GCP resources in a project, not just BigQuery. For example, if a user has the Editor role, then they’ll be an Editor not only for BigQuery, but also for Google Cloud Engine instances or any other resources in the project. If you only have BigQuery enabled in this project, then this isn’t an issue.
- Second, each user can have the same permissions for all datasets in a project. For example, if each team has owner permissions for their team’s datasets, then they’ll also have owner permissions for other team’s datasets in this project.
- Third, you don’t need to separate data access permissions from job-running permissions. For example, if a user has permission to view data in a project, then they can also run queries on that data.
If all of these requirements are true for a project, then primitive roles are a good solution, but otherwise, you’ll need to use predefined roles.
There are six predefined roles. The dataViewer, dataEditor, and dataOwner roles are essentially the same as the primitive roles except for two things: First, you can assign these roles to users for individual datasets, and second, they don’t give users permission to run jobs or queries. Those permissions can be granted through the user and jobUser roles. A jobUser can only start jobs and cancel jobs. A user, on the other hand, can perform a variety of other tasks, such as creating datasets. The admin role gives all permissions.
Here’s how you can use predefined roles to give more fine-grained access than primitive roles in each of these situations:
- To give a different level of access to BigQuery than to other GCP resources in a project, use BigQuery roles, such as BigQuery Data Editor. Then also use predefined roles for any other GCP resources they need to access, such as App Engine Admin.
- To give a user or group a different level of permissions for a dataset in a project, click the down arrow next to the dataset and select “Share dataset”. Then add a user or group by putting in their email address. Then select one of the three roles (owner, editor, or viewer).
- To give job-running permissions without giving data access permissions, select either BigQuery User or BigQuery Job User. In most cases, you should give BigQuery User because it lets the user list datasets and tables as well as create new datasets. Job Users can only run jobs.
You might be wondering why you would want to separate data access permissions from job-running permissions because it would seem like most users would need both. That’s true in many cases, so remember that you have to assign both types of roles to users in order for them to look at the data and run queries (unless you’re using primitive roles). But there are situations when you might want to give only one or the other. For example, if you have an application that monitors the size of your tables, then you might want to assign only the BigQuery Data Viewer role to its service account. That way, even if the application developers accidentally make a change to the program that would cause a query to run, it will be disallowed. Why would that matter? Because queries incur a cost and a program could potentially run up some hefty charges if there’s a bug.
So far, I’ve only shown you how to set permissions at the project and dataset levels. There’s a way to set permissions at the table level and even to particular data within a table, but it’s a much more complicated process. The only way to do it is to use something called an authorized view.
We already created a view in the last lesson. What’s different about an authorized view is that it allows users to access the results of a query without giving them access to the tables that were queried. So, for example, if you didn’t want a particular group of users to have access to certain columns in a table, you could run a query that didn’t include those columns, and then save the results as an authorized view for that group.
To make this work, you need to perform four steps:
- First, create a separate dataset to store the view. You need to do this because if you were to put the view in the same dataset as the original tables, then the group would be able to access the tables too and not just the view.
- Second, create the view in the new dataset.
- Third, give the group read access to the dataset containing the view.
- Fourth, authorize the view to access the source dataset.
This assumes that you’ve already given the group permission to run queries in the project.
Let’s say you wanted to give a group called “team1” access to only the name, age, and gender fields in the persons_data table. Before I start, I should mention that if you don’t currently have permission to assign roles to other users, then you won’t be able to do all of the steps I’m about to show you.
OK, first create the new dataset. Call it “shared_views”.
Then, run a query to select the name, age, and gender fields from the table.
Now click “Save View”. Change the dataset to “shared_views” and call the table “persons_view”.
Then click the down arrow to the right of shared_views and select “Share dataset”. In the menu on the left, select “Group by e-mail” since we’re giving permission to a group, not a user. Then I’ll fill in the email address of the team1 group. If you have a group you can assign, then use that one. Leave the permission on “Can view” and click Add. You might want to uncheck “Notify people via email” if you’re testing this with an actual group of people. And click Save Changes.
Now that team1 has read access to the view, the only thing left to do is to give the view read access to the persons_data table. We need to do this because the view takes on the permissions of the person using it, and since team1 doesn’t have access to the persons_data table, they’d get an error if they tried to use this view.
Now we’re finally at the point where we’re going to turn this into an authorized view. In the menu next to the examples dataset, select “Share dataset”. In the menu at the left, select “Authorized View”. Then click “Select View”. Change the dataset to “shared_views” and put “persons_view” for the table. Click OK, click Add, and save the changes.
Now users in team1 will be able to run queries on this view, even though they don’t themselves have access to the persons_data table.
Before we go, you’ll probably want to delete everything you’ve created in this course. Fortunately, that will be very easy. Click the menu next to examples and select “Delete dataset”. Type in “examples” to confirm that you want to delete the dataset and all of the tables in it. Now do the same thing for the shared_views dataset.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).