Bigtable is an internal Google database system that’s so revolutionary that it kickstarted the NoSQL industry. In the mid 2000s, Google had a problem. The web indexes behind its search engine had become massive and it took a long time to keep rebuilding them. The company wanted to build a database that could deliver real-time access to petabytes of data. The result was Bigtable.
Google went on to use Bigtable to power many of its other core services, such as Gmail and Google Maps. Finally, in 2015, it made Cloud Bigtable available as a service that its customers could use for their own applications.
In this course, you will learn which of your applications could make use of Bigtable and how to take advantage of its high performance.
Learning Objectives
- Identify the best use cases for Bigtable
- Describe Bigtable’s architecture and storage model
- Optimize query performance through good schema design
- Configure and monitor a Bigtable cluster
- Send commands to Bigtable
Intended Audience
- Data professionals
- People studying for the Google Professional Data Engineer exam
Prerequisites
- Database experience
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
The example code is at https://github.com/cloudacademy/cloud-bigtable-examples/tree/master/java/dataproc-wordcount.
There are a few things to consider when configuring a Bigtable cluster that will affect its performance. When everything’s running smoothly and you’re using SSDs, you can expect to see performance numbers like these: 10,000 queries per second and 6 millisecond latency for both reads and writes...and 220 meg per second of throughput on scans. These numbers are per node, and Bigtable scales linearly, so if you were running 10 nodes, then your performance would typically be 10 times these numbers, which is pretty incredible.
You’ll notice that the performance of magnetic hard drives (or HDDs) is not nearly as fast, especially for reads. You can only get 5% as many queries per second and the latency is 33 times as long. You can get as many writes per second, but the latency is still 8 times as long. These numbers are for reading or writing a single row. If you scan multiple rows, then the throughput isn’t much slower than with SSDs.
But, overall, the HDD performance is really bad. So why would you ever use HDDs with Bigtable? Well, in most cases, you shouldn’t. Although they’re less than one-sixth the cost of SSDs, you’d have to use way more nodes in your cluster to get even close to the performance of SSDs, and the extra cost for the nodes would far outweigh the cost savings from storage.
The only time you might consider using HDDs is if you’re storing more than 10 terabytes of data, your application is not sensitive to response time, and you’re mostly running batch scans and writes, rather than random reads of individual rows. But frankly, if that’s the type of application you’re planning to build, then you’d probably be better off using something like BigQuery instead anyway.
There’s one more really important thing to keep in mind. Once you choose HDDs, you can’t switch to SSDs later. To make a change, you’d have to export all of the data, create a new SSD-based instance, and then import all the data into the new one. And that wouldn’t be much fun.
OK, we’ve covered two possible causes of poor performance in Bigtable. The first was choosing a poor row key and the second was using HDDs. There are lots of other potential sources of performance problems too.
If you have a small amount of data or access it for a short period of time, then your performance will be much lower than the numbers I showed earlier. Bigtable is designed for large workloads.
You could also have the opposite problem of too much work for too few nodes. To prevent this from happening, you need to monitor Bigtable’s performance and add more nodes when the existing ones are overloaded. You can do that manually through the console or programmatically using Stackdriver Monitoring. I’ll show you how to monitor Bigtable later.
If you do add more nodes to your cluster, bear in mind that it will take up to 20 minutes under load before you’ll see an improvement in performance.
Another reason for having too few nodes is if you’re running a development instance, which only has one node. Fortunately, you can upgrade an existing instance from development to production. You can’t downgrade from production to development, though.
Finally, you could be having network issues. One reason for this would be if your clients are running in a different zone from your Bigtable cluster. You should always have them in the same zone, if possible.
Alright, that’s it for cluster configuration.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).