image
Conclusion
Start course
Difficulty
Intermediate
Duration
48m
Students
2539
Ratings
4.9/5
starstarstarstarstar-half
Description

Bigtable is an internal Google database system that’s so revolutionary that it kickstarted the NoSQL industry. In the mid 2000s, Google had a problem. The web indexes behind its search engine had become massive and it took a long time to keep rebuilding them. The company wanted to build a database that could deliver real-time access to petabytes of data. The result was Bigtable.

Google went on to use Bigtable to power many of its other core services, such as Gmail and Google Maps. Finally, in 2015, it made Cloud Bigtable available as a service that its customers could use for their own applications.

In this course, you will learn which of your applications could make use of Bigtable and how to take advantage of its high performance.

Learning Objectives

  • Identify the best use cases for Bigtable
  • Describe Bigtable’s architecture and storage model
  • Optimize query performance through good schema design
  • Configure and monitor a Bigtable cluster
  • Send commands to Bigtable

Intended Audience

  • Data professionals
  • People studying for the Google Professional Data Engineer exam

Prerequisites

 

The example code is at https://github.com/cloudacademy/cloud-bigtable-examples/tree/master/java/dataproc-wordcount.

 

Transcript

I hope you enjoyed learning about Google Cloud Bigtable. Let’s do a quick review of what you learned. Bigtable is designed for applications that need low latency access to large amounts of data. A couple of common uses are to support MapReduce operations and real-time analytics.

Bigtable is not a relational database. You can’t query it using SQL, which is why it’s called a NoSQL database. Only single-row transactions have strong consistency. Multi-row transactions have eventual consistency.

Bigtable has an architecture designed for high performance. A Bigtable cluster has nodes that contain pointers to tablets, which are groups of rows. Tables are sparse. That is, empty cells don’t take up any space, so you can have millions of columns. The row key is the only index, so if a query doesn’t reference the row key, it will result in a full table scan. Rows contain key/value pairs. A column key consists of the column family plus the column qualifier plus the timestamp for this version of the data. And columns are grouped into column families.

The most important part of schema design is choosing a row key. It should contain the information needed to support the most common queries. It should also distribute reads and writes evenly across nodes by avoiding hotspotting. When row keys are sorted, they should ideally have identical chunks on adjacent rows, which helps Bigtable compress them. Finally, row keys should have a relatively short length.

Columns that are usually retrieved together should be grouped into a column family. And when you’re working with time series data, you should use tall tables with a relatively small number of columns.

When you’re configuring a cluster, you should use SSDs, rather than HDDs, in most cases, to optimize performance. You need to run large workloads to take advantage of Bigtable’s architecture. To ensure good performance, you should monitor your cluster and add nodes when necessary. For low latency, you should put your client applications in the same zone as your Bigtable cluster.

To control which Google Cloud Platform users have read and write permissions to your tables, set reader, user, and admin roles in IAM. They can only be assigned at the project level, though.

Pricing is based on the number of nodes in your cluster, the amount of data stored in your tables, and the amount of cross-region network traffic. You can monitor a cluster’s performance either manually in the the web console or using Stackdriver Monitoring. You can query and manipulate the data in your tables using either the HBase shell or the cbt command.

Now you know the best use cases for Bigtable, the details of its architecture and its storage model, how to optimize performance by designing a good schema and properly configuring a cluster, how to monitor a cluster, and how to use the HBase shell to send commands to Bigtable.

To learn more about Cloud Bigtable, you can read Google’s documentation. Also, watch for new big data courses on Cloud Academy, because we’re always publishing new courses.

If you have any questions or comments, please let me know in the Comments tab below this video or by emailing support@cloudacademy.com. Thanks and keep on learning!

About the Author
Students
201667
Courses
97
Learning Paths
162

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).