Bigtable is an internal Google database system that’s so revolutionary that it kickstarted the NoSQL industry. In the mid 2000s, Google had a problem. The web indexes behind its search engine had become massive and it took a long time to keep rebuilding them. The company wanted to build a database that could deliver real-time access to petabytes of data. The result was Bigtable.
Google went on to use Bigtable to power many of its other core services, such as Gmail and Google Maps. Finally, in 2015, it made Cloud Bigtable available as a service that its customers could use for their own applications.
In this course, you will learn which of your applications could make use of Bigtable and how to take advantage of its high performance.
- Identify the best use cases for Bigtable
- Describe Bigtable’s architecture and storage model
- Optimize query performance through good schema design
- Configure and monitor a Bigtable cluster
- Send commands to Bigtable
- Data professionals
- People studying for the Google Professional Data Engineer exam
- Database experience
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
The example code is at https://github.com/cloudacademy/cloud-bigtable-examples/tree/master/java/dataproc-wordcount.
To use Bigtable effectively, it’s important to understand its architecture and storage model.
You can tell that Bigtable is designed for large amounts of data when you look at this architecture diagram.
When a client sends a request to Bigtable, it first goes through the Front-end server pool, which is basically a security layer. Then the request gets sent to a node in a Bigtable cluster. The node then reads from or writes data to Colossus, which is Google’s filesystem.
This architecture has a couple of interesting characteristics. First, you can easily scale the total throughput and number of simultaneous requests handled by adding more nodes to the cluster. Second, since the nodes don’t hold any of the data, if a node goes down, there’s no data loss.
OK, so what do the nodes actually do if they don’t hold data? Well, they hold pointers to specific subsets of the table. Bigtable divides (or shards) a table into groups of rows. Each group of rows is called a tablet. Tablets are stored on Colossus in SSTable format. Each tablet is associated with a particular node.
The beauty of using pointers to the data is that Bigtable can quickly rebalance the cluster load by moving pointers from one node to another. For example, if a majority of the requests are going through one particular node, then Bigtable can move the pointers for some of the tablets from that node to the other nodes. The overall performance will improve because the nodes are now sharing the load fairly equally.
It can even do this if an individual tablet is getting a disproportionate number of requests. In this case, it will split the tablet in two, and then move the pointer for one of the new tablets to another node. There are limits to its ability to rebalance, though, based on how well you design your schema, which we’ll cover in the next lesson.
Alright, now that you’ve seen the architecture, let’s move on to the data model. I’ll take you through an example. Suppose you have a table of stock data after a stock exchange has closed for the day. Each stock has a stock symbol, a last sale price, the number of shares sold in that transaction (called “last size”), and when the trade occurred.
In Bigtable, you could represent this data with a structure that looks like this. Each record takes up a row in the table. The row key is the stock symbol for a company. Each record has 3 columns with data related to the row key: last sale, last size, and trade time. So far, this should look very familiar because relational database tables look like this too.
One difference is that Bigtable has what it calls “column families”. These are groups of related columns. This example only has one column family, called “TRADE”.
To illustrate some other differences, let’s look at another example. Suppose you have a table that tracks which users are following which other users on a social network like Twitter. This table structure is radically different from what you’d use in a relational database. There’s a row for each user and a column for each user. This is known as a wide table, because of the potentially large number of columns. In contrast, the stock market table was a tall table, because each row is quite narrow (with only 3 data columns in this case), but there are a large number of rows.
OK, back to the wide table example. If a particular user, such as “gwashington”, follows another user, such as “jadams”, then there will be a 1 in that cell. If gwashington doesn’t follow a user, then the contents of the cell is empty.
Here’s one of the big differences compared to relational databases. If a cell is empty, then it doesn’t take up any space in the database. That’s because tables in Bigtable are sparse. This means, for example, that if a row with a thousand columns only has data in one column, then that row will only take up the space required for that one piece of data. That’s why you can get away with having seemingly crazy table structures like this one, where you could potentially have millions of columns.
Bigtable makes this work by storing both the column qualifier (that is, the column name) and the column value in a row. So each row is simply a list of key/value pairs, where a key is the column family plus the column qualifier. There’s actually one other piece of information in the key as well, a timestamp.
That might seem like a weird thing to include in a key, but this gives Bigtable the ability to store multiple copies of each cell. Every time the value in a cell changes, it stores the new value along with the timestamp of when it changed. The old versions don’t stay around forever, but multiple versions do exist for a while.
That’s it for the data model. In the next lesson, I’ll show you how the architecture and data model affect your schema design.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).