Bigtable is an internal Google database system that’s so revolutionary that it kickstarted the NoSQL industry. In the mid 2000s, Google had a problem. The web indexes behind its search engine had become massive and it took a long time to keep rebuilding them. The company wanted to build a database that could deliver real-time access to petabytes of data. The result was Bigtable.
Google went on to use Bigtable to power many of its other core services, such as Gmail and Google Maps. Finally, in 2015, it made Cloud Bigtable available as a service that its customers could use for their own applications.
In this course, you will learn which of your applications could make use of Bigtable and how to take advantage of its high performance.
Learning Objectives
- Identify the best use cases for Bigtable
- Describe Bigtable’s architecture and storage model
- Optimize query performance through good schema design
- Configure and monitor a Bigtable cluster
- Send commands to Bigtable
Intended Audience
- Data professionals
- People studying for the Google Professional Data Engineer exam
Prerequisites
- Database experience
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
The example code is at https://github.com/cloudacademy/cloud-bigtable-examples/tree/master/java/dataproc-wordcount.
As powerful as Bigtable is, it’s not a good choice for every application. That’s because Google doesn’t build “normal” applications. It builds applications that typically have over a billion users and are focused on organizing the world’s information. Those aren’t the only kinds of applications that will be well served by Bigtable, of course, but it gives you an idea of the types of services it’s meant for.
So what should Bigtable be used for? In a nutshell, it should be used for low latency access (that is, fast access) to big data. Bigtable has some characteristics that will help you decide whether or not it’s a good fit for a particular application.
First, it’s only a good solution for at least one terabyte of data. For smaller amounts of data, the overhead is too high.
Second, Bigtable’s performance will suffer if you store individual data elements larger than 10 megabytes. If you need to store unstructured objects that are larger than that, such as video files, then Cloud Storage may be a better option.
Third, and this one is really important, Bigtable is not a relational database and does not support SQL or multi-row transactions. This makes it unsuitable for a wide range of applications, especially online transaction processing.
Fourth, it’s designed to store key/value pairs. If you need to store data with more structure than that, then you should use a different database.
Considering all of those limitations, it might seem like Bigtable isn’t good for many scenarios. But there are certain use cases where it is an extremely good choice. Perhaps the most common is as part of a big data processing system that performs MapReduce operations. If you’re using Cloud Dataflow or Cloud Dataproc, then Bigtable is a great storage option because it has very high throughput and scalability. It also supports the HBase API, so it integrates easily with Apache Hadoop and Spark (both of which can run on Cloud Dataproc).
It’s also a good fit for real-time analytics. That is, if your application needs to perform analytics on events as they’re happening, then Bigtable will work well. This is a common use case with financial services and Internet of Things. If, on the other hand, you need to do interactive analytics, where you can run SQL queries on a data warehouse, then BigQuery is the right choice.
The biggest limitation of Bigtable is its lack of relational database capabilities, but Bigtable can scale so much better than traditional relational databases that Google came up with ways to bring the two worlds together. It added software on top of Bigtable that supports:
- More complex data than simple key/value pairs,
- Secondary indexes (instead of just one primary index),
- ACID properties for reliable transactions (that is, atomicity, consistency, isolation, and durability), and
- A SQL-like query language.
Google released this new database service as Cloud Datastore (although they actually released Datastore publicly before Bigtable).
Another difference is in the pricing structure. With Datastore, you pay for monthly storage as well as reads and writes. With Bigtable, you also have to pay for monthly storage, of course, but instead of paying for reads and writes, you have to spin up a cluster and pay for it as long as it’s running. The result is that for a small amount of data or infrequent access, Datastore is cheaper, but for large amounts of data and frequent access, Bigtable is cheaper.
Although Datastore has many of the features of relational databases, it’s still missing some important ones. So Google created yet another Bigtable-based service. It’s called Cloud Spanner and it was only released to the public in 2017. It includes these additional features:
- A relational schema,
- Strong consistency for all queries (rather than eventual consistency),
- SQL support, and
- Multi-region deployments.
So Cloud Spanner gives you the best of both worlds: massive scalability and strong consistency. So why wouldn’t you always use Cloud Spanner instead of Bigtable or Datastore? Because those additional capabilities come at a price. Cloud Spanner is Google’s most expensive database service. So, for example, if you just need to do high-speed analytics, then Bigtable would be cheaper and less complicated.
By the way, Google does offer one more database service. Cloud SQL is a managed service for MySQL and PostgreSQL, so if you want to run a traditional relational database, this is the best way to do it on Google Cloud Platform.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).