The course is part of these learning paths
See 3 moreBigtable is an internal Google database system that’s so revolutionary that it kickstarted the NoSQL industry. In the mid 2000s, Google had a problem. The web indexes behind its search engine had become massive and it took a long time to keep rebuilding them. The company wanted to build a database that could deliver real-time access to petabytes of data. The result was Bigtable.
Google went on to use Bigtable to power many of its other core services, such as Gmail and Google Maps. Finally, in 2015, it made Cloud Bigtable available as a service that its customers could use for their own applications.
In this course, you will learn which of your applications could make use of Bigtable and how to take advantage of its high performance.
Learning Objectives
- Identify the best use cases for Bigtable
- Describe Bigtable’s architecture and storage model
- Optimize query performance through good schema design
- Configure and monitor a Bigtable cluster
- Send commands to Bigtable
Intended Audience
- Data professionals
- People studying for the Google Professional Data Engineer exam
Prerequisites
- Database experience
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
The example code is at https://github.com/cloudacademy/cloud-bigtable-examples/tree/master/java/dataproc-wordcount.
To use Cloud Bigtable, you will almost certainly need to connect to it from another product or service. Bigtable integrates with Cloud Dataflow (Google’s big data processing system), Cloud Dataproc (Google’s service for running Hadoop and Spark jobs), and BigQuery (Google’s data warehouse). Bigtable also integrates with Hadoop even if it’s not running on Cloud Dataproc. That’s because Bigtable supports the Apache HBase API.
Bigtable also integrates with a number of other products, including JanusGraph, Terraform, OpenTSDB, and Heroic.
Since one common way to use Bigtable is as a datastore for Hadoop jobs, I’ll show you how to do that. If you want to do it yourself as well as I go through it, and you don’t already have a GCP account, you can create a free trial account.
I’m going to go through Google’s example at this address. That’s a pretty long URL, so if you don’t want to type it in, you can find it near the bottom of the “About this course” tab below.
This example spins up a Dataproc cluster and runs a Hadoop MapReduce job on it. The job counts the number of occurrences of each word in The Iliad of Homer. So, for example, the MapReduce job determines that the word “Troy” appears 92 times in The Iliad. It stores all of the word counts in Bigtable.
To run this job, we need to perform quite a few steps, so here’s a summary of what we need to do. First, we need to make sure all of the necessary APIs are enabled. Then, we need to create a Bigtable cluster. Next, we’ll create a Cloud Storage bucket that Cloud Dataproc can use. After that, we’ll create the jar file for the Hadoop MapReduce job. Then we’ll create the Dataproc cluster, and finally run the job.
OK, first we need to make sure the appropriate APIs are enabled. In the Cloud Console menu, select “APIs & services”. Check the list for the Cloud Bigtable API, the Cloud Bigtable Admin API, and the Cloud Dataproc API. If one of them is not enabled, then you’ll need to click on “Enable APIs and Services”. In my case, the Cloud Bigtable API and the Cloud Dataproc API are enabled, but the Cloud Bigtable Admin API isn’t enabled. I’ll type “bigtable” in the search box...and then click on Cloud Bigtable Admin API. Now, I’ll click “Enable”.
OK, now that all of the APIs are enabled, we need to create a Bigtable cluster. Select “Bigtable” from the menu. Now click the “Create instance” button. I’ll call the instance “test”. The instance ID has to be at least 6 characters long, so I’ll call it “example”. Even though this’ll be a small workload, I’m going to create a production instance because that will give us more monitoring options later on.
It fills in the Cluster ID automatically. I’m going to set the zone to us-central1-b. Even if HDDs would be sufficient for this job, I’m still going to use SSDs because, as I mentioned earlier, you can’t change from HDDs to SSDs later if there’s a performance problem.
Alright, now click the “Create button”. It creates the cluster incredibly quickly.
Now we need to create a Cloud Storage bucket that will be used by Cloud Dataproc. You can do this from the command line, but I’m going to do it from the console. Click “Create Bucket”. You have to choose a name that’s unique across all of Google’s Cloud customers, so the easiest way is to include the project ID, which is guaranteed to be unique. You can copy it from the home page of the console. Now I’ll paste it and add “-dataproc” since I’ll be using it with the Dataproc service.
We could leave this as Multi-Regional, but I’m going to run everything in one region, which will be cheaper and should help with performance. Now click “Create”...and that’s done.
Now we need to create the jar file for the Hadoop MapReduce job. You can do this on your own desktop, but you’d have to install the Google Cloud SDK first. To make this as easy as possible, I’m going to use Cloud Shell instead because it already has the Cloud SDK installed on it. Just click the “Activate Google Cloud Shell” icon. It’s also a good idea to click “Open in new window” so you’ll be able to see more.
Now we need to download this repository to the Cloud Shell VM. Click on “cloud-bigtable-examples”. Then click the “Clone or download” button and copy the URL. Now, at the Cloud Shell prompt, type “git clone” and paste the github URL.
Then cd to cloud-bigtable-examples (you can hit the Tab key after typing the first character and it’ll fill in the rest) slash java slash dataproc-wordcount. To get the next command, open the editor, go to the File menu, and select “Refresh”. Then click the arrow next to cloud-bigtable-examples and then java and then click on dataproc-wordcount. If you scroll down a bit, you should see a Maven command. Copy and paste the first part of it. Then fill in your project ID. Then add a space and copy the next argument. Now type “example” or whatever Bigtable instance ID you used. It’ll take a couple of minutes, so I’ll fast forward.
OK, it’s done. Now we have to create the Dataproc cluster using the cluster.sh script. You don’t actually need to type this changemod command. Copy and paste the first part of the command, then copy your bucket name from Cloud Storage, and then put in “dp” for the cluster name, and then the zone where you created your Bigtable cluster. I created mine in “us-central1-b”. Now it’s spinning up one master Dataproc node and 4 worker nodes. I’ll fast forward again.
Once it’s done, go to the Dataproc console and you should see the dp cluster you created.
Now go back to Cloud Shell and copy and paste the first part of this command and then the cluster name, which is “dp”. This starts the MapReduce job. It should take about a minute to finish.
While it’s running, we can take a look at what’s happening with Bigtable. Go back to the Bigtable console and click on the instance. It shows the CPU utilization graph, but it shows it at zero percent. That’s because it’ll need a little while to catch up with what we’ve been doing.
In this dropdown, you can see lots of different options for different kinds of graphs you can bring up. Have a look at “Write requests”. It’s showing some data here now, but it’s showing that it’s already falling, so the job must be almost finished. Let’s see what else is available.
If this had been a development instance of Bigtable, then these CPU utilization options wouldn’t be available. Let’s try “CPU utilization of hottest node”. Well this job obviously didn’t put much stress on the Bigtable cluster.
If you see a significantly higher CPU usage on your hottest node than the average on the CPU utilization graph, then that means you have a hotspot due to reads and/or writes not being evenly distributed between the nodes in your cluster. In that situation, Bigtable will attempt to redistribute some of the data to other nodes in the cluster.
If you don’t want to check up on Bigtable’s performance manually, then you can use Stackdriver Monitoring to watch the CPU usage on your cluster and automatically add more nodes if the CPU utilization gets too high. You can also get it to remove nodes if the CPU utilization falls.
Google provides some code examples to do this in Java and Python. They’re at this URL. If you decide to use automatic scaling like this, then bear in mind that it can take up to 20 minutes after adding more nodes before you’ll see a significant improvement in performance. If your cluster only experiences short bursts of high utilization, then adjusting the number of nodes will just cost more money without giving any additional performance.
Speaking of money, how much does Bigtable cost? Its pricing is based on 3 components: nodes, storage, and network. Each node costs 65 cents per hour. Storage costs 17 cents per gigabyte per month for SSDs and only 2.6 cents per gig per month for HDDs. Network ingress is free. Network egress between regions is 1 cent per gig within the US and much higher across continents.
Here’s an example. Suppose you run a 5-node cluster for one month and have 10 terabytes of data on SSD storage. Also suppose that all of your network traffic is within the same region. Then the node cost for one month would be 5 nodes times 65 cents per hour times 24 hours per day times 30 days equals $2,340. The storage cost would be 10 terabytes times 1,024 gigabytes per terabyte times 17 cents per gig equals $1,740.80. Network traffic would be free since it’s all in the same region. So the total cost for the month would be $4,080.80. Bigtable is not cheap. So I guess we’d better check the results of the job and then shut down our Bigtable instance.
If you go back to Cloud Shell, you’ll see that the job has finished. To take a look at what it wrote to Bigtable, there are two options: the HBase shell and the cbt command. They both take a bit of work to set up, but I’m going to show you the HBase shell, since HBase is an open standard.
Unfortunately, to make it work with Bigtable, you have to download a Maven project and build it. You’ll find the command to download it in the course.md file in the github repository. Here it is. Now unzip the file it downloaded. To build it and start the shell, copy this command. Whoops, I needed to go into the quickstart directory first.
It’ll ask you for your Bigtable instance ID. Type “example”. OK, we’re finally at the HBase shell. Now to get a list of the tables in the instance, type “list”. There should only be one table and it’ll be called WordCount dash and then a number.
Let’s get a sample of the rows in the table. Type “scan” and then copy and paste the table name, including the quotes. Then copy this. Make sure you include the comma. This limits the scan to 30 rows. If you don’t put in this limit, then it’ll print every row in the table, which would be thousands of them. It’s a list of words and the number of times each one occurs.
To see what other HBase commands are available, have a look at this web page.
Alright, since it’s pretty expensive to run a Bigtable cluster and a Dataproc cluster, we should shut them both down. To remove the Dataproc cluster, first type “exit” to get out of the HBase shell. Then go back to the directory we were in before...and type “./cluster.sh delete dp”. It’ll take a little while to delete all of the VMs in the Dataproc cluster.
To remove the Bigtable instance, go back to the Bigtable console...and click “Delete Instance”. To verify that you really do want to delete it, you have to type in the instance name, “example”. Then click “Delete”.
OK, now the only thing left is the Cloud Storage bucket. The amount of data in this bucket is pretty small, so the monthly cost will be negligible, but if you won’t be using Cloud Dataproc again, you should probably delete it.
And that’s it for this demo.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).