The course is part of these learning paths
Bigtable is an internal Google database system that’s so revolutionary that it kickstarted the NoSQL industry. In the mid 2000s, Google had a problem. The web indexes behind its search engine had become massive and it took a long time to keep rebuilding them. The company wanted to build a database that could deliver real-time access to petabytes of data. The result was Bigtable.
Google went on to use Bigtable to power many of its other core services, such as Gmail and Google Maps. Finally, in 2015, it made Cloud Bigtable available as a service that its customers could use for their own applications.
In this course, you will learn which of your applications could make use of Bigtable and how to take advantage of its high performance.
- Identify the best use cases for Bigtable
- Describe Bigtable’s architecture and storage model
- Optimize query performance through good schema design
- Configure and monitor a Bigtable cluster
- Send commands to Bigtable
- Data professionals
- People studying for the Google Professional Data Engineer exam
- Database experience
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
The example code is at https://github.com/cloudacademy/cloud-bigtable-examples/tree/master/java/dataproc-wordcount.
Since Bigtable’s greatest strength is handling large volumes of data at low latency, you need to carefully design your schema so you don’t accidentally erase its speed advantage.
The most important decision to make is what to use for a row key. There are many reasons for this. First, a table has only one index and that’s the row key. Rows are sorted by the row key. If you try to query a table using anything else, then it will have to do a full table scan, which would be very time consuming.
Second, as you saw in the last lesson, the rows of a table are grouped into tablets. Ideally, you want reads and writes to be distributed fairly evenly among the tablets, so an individual node doesn’t get swamped by requests.
Third, if you have identical chunks of data next to each other on adjacent rows, then Bigtable can compress the data.
Fourth, short row keys take up less memory and storage and result in faster response times.
Let’s see how this applies to the stock exchange example. Suppose instead of just containing the last trade of the day for each stock, the table contains all of the trades throughout the day for the past year.
The first thing to consider is what queries will typically be used on this table. Let’s say that most of the queries ask for all of the trades for a given stock for the past day. Since the row key is the only index, we should definitely use the stock symbol in the row key. But if we only include the stock symbol, then Bigtable would have to do a full table scan to retrieve all of the stock’s trades for the day because the dates wouldn’t be sorted.
The simple solution is to include the timestamp as part of the row key, like this. In fact, it’s very common to include multiple pieces of data in a row key. But you’d almost never do this with a relational database.
The next thing to check is that reads and writes would be evenly distributed across nodes. If we were to put the timestamp first in the row key, then we’d have a hotspot, because as the trades came in from the stock exchange, all of the rows would be written sequentially through the same node, since the table would be sorted by date. We’d have the same problem for reads because most of our queries would be for the previous day, which again, would be linked to the same node.
Putting the stock symbol first, on the other hand, would distribute reads and writes evenly between nodes. For example, the data for stock symbol ACME would be on the first node and the data for ZIPP would be on the last node. As the trades came in from the stock exchange, they would be written on the node where the stock symbol resides, like this.
Next, let’s look at compression efficiency. With this row key, we’d have identical patterns of data at the front of the key for lots of adjacent rows, so that’s a plus.
Finally, how about the size of the row key? Well, it may look kind of long because we’ve concatenated two pieces of data together, but really, it’s still relatively short, especially because the stock symbol is limited to four characters.
Now that we have a row key, what do we do with the columns? In this example, not much. The main decision to make is how to group them in column families. Generally speaking, the columns in a column family are stored together, accessed together, and cached together, so you can make queries more efficient by grouping related columns into column families.
In this case, though, we’ll usually be retrieving all of the columns when we perform a query, so we can put them all in the same column family. It doesn’t matter too much what we call it, but it’s a good idea to keep it short, so let’s stick with “TRADE”.
Now let’s look at a slightly more complicated example. Suppose you need to collect about 100 server metrics, such as CPU, memory, and disk usage, every minute, from a large number of servers in your datacenter. Also suppose that there will be two common types of queries on this data. The first is a request for the metrics for a given machine over a certain period of time. The second is a request for the latest metrics for all of the machines.
So what should we use for a row key? Since the first type of query requests metrics from a particular server, the row key should start with the server name. And since it requests metrics from a particular period of time, we should include the timestamp.
However, if we use those two pieces of data for the row key, then it will be difficult to run the second query efficiently. That’s because it would need to retrieve one row of data (the latest one) for every single server, which would be spread out over the entire table.
In situations like this, the best solution is usually to create a separate table with a different row key, and possibly only a subset of the data. In this case, we can create a new table called “CURRENT_METRICS” that contains a single row (the latest one) for each server. Then the row key could simply be the server name.
There’s a slight problem with this table, though. Since each row is supposed to hold the latest data for a server, you’d have to overwrite the data in every row every minute because that’s how often we collect new metrics.
This isn’t a real problem, though, because Bigtable supports single-row transactions. That is, if you update multiple columns in a row at the same time, Bigtable will guarantee that all of the column updates will either succeed or fail together. This is also known as “atomicity”, which is what the ‘A’ stands for in the “ACID” acronym. Bigtable doesn’t guarantee atomicity for multi-row transactions, though, which is one of the reasons why you can’t use it as a relational database.
OK, so we can reliably update each row every minute. Now, will these two row keys balance the load between the nodes in the cluster? Yes. In the case of the first table, which we’ll call “METRICS”, the records should be more or less randomly distributed because the key starts with the server name rather than the timestamp.
Similarly, since the CURRENT_METRICS table uses only the server name as the row key, writes should be randomly distributed as the latest metrics come in for the various servers. Reads will be evenly distributed too, but in a different way, because the query this table is designed for is to read all of the server metrics. That will result in a full table scan, which we would normally try to avoid, but there’s no way around it in this case.
Now, how about the columns? Since many of the queries are likely to request only a subset of the 100 or so metrics for each server, it would probably be a good idea to group them into column families. For example, the various CPU metrics should be in a column family called “CPU” and the various memory metrics should be in a column family called “MEMORY”.
On the other hand, if your queries will almost always request all of the metrics instead of a subset of them, then you should use one column family for all of the columns, because that will improve performance.
Alright, now I’ll go over a few more things to keep in mind when you’re designing a row key. The most common issue for time series tables, (that is, where each record has a timestamp), is hotspotting. The technique we used to resolve hotspotting on the stock market and server metrics examples was to move one of the data fields to the beginning of the row key, such as moving the stock symbol or the server name into the row key. This is called “field promotion” and it’s almost always the best solution to hotspotting in time series tables.
There are two more techniques that are occasionally used, but they have major disadvantages. The first is salting. The idea is that you add a bit of salt to each key, which just means to add a number. For example, you could add either a 1, 2, or 3 to each key. One way to ensure randomness would be to use a hash function of the timestamp that only resulted in one of those three numbers.
By putting one of these numbers before the timestamp, you would randomly spread the data into three different piles. If you had 3 nodes in your Bigtable cluster, then this would work well. But if you had to scale up to more nodes, then this salting design might not work as well. Of course, you could use more than 3 numbers and hope that it works well in most cases, but there’s another, potentially bigger issue.
If a query requested multiple records over a time period, then your code would have to run separate scans for each salt value and combine the results, which would be a real pain.
Another technique is to randomize the keys by applying a hash function to the timestamp and using that, instead of the timestamp itself, as part of the key. This would distribute your data nicely across the cluster, but it would be even more problematic than using salting. That’s because there would be no way to query a time range. You could only query a single row or do a full table scan.
As you can see, field promotion is by far the best solution to hotspotting for time series tables, so only use salting or randomization if field promotion doesn’t work for some reason.
I don’t want to give you the impression that using hash functions is always a bad idea, though. They can be very useful in many cases, just not with timestamps. For example, if you want to start a row key with a username, followed by another field, such as a follower of that user, then you may want to put a hash of the username into the key instead of the username itself, and also do the same with the username of the follower. That way, each portion of the key would always be the same length, which could be helpful.
This wouldn’t cause the problem we saw with hashing a timestamp because you’d likely be querying records for individual usernames instead of for a range of usernames.
One type of row key to avoid is domain names. That’s because most organizations have a hierarchy of domain names rather than just one, but they won’t be grouped together in the table when they’re sorted. The solution is to reverse the domain names. That way, all of the domains for an organization will be grouped together, like this.
OK, although Bigtable is massively scalable, it does still have some limits. First, here are the recommended limits. Row keys should be less than 4K in size. You shouldn’t have more than about 100 column families per table. Column qualifiers should be less than 16K. And the next two are important. Individual values should be less than 10 meg per cell and all values in a single row should be less than 100 meg.
Bear in mind that these are recommended limits only. Here are the hard limits. Individual values cannot be larger than 100 meg per cell. All values in a single row cannot be larger than 256 meg. And you cannot have more than 1,000 tables per cluster.
And that’s it for Bigtable schema design.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).