Creating Data Storage Resources
Creating Data Storage Resources

Computing services such as virtual machine instances, container orchestration systems, serverless, etc., gain a lot of attention in the tech world, but storage and networking are also essential for almost all applications. Data storage is a broad topic covering a wide variety of storage mechanisms for different use cases. Networking is vital for service communication, and security is always important, though typically an afterthought.

As the technologies used to build distributed systems keep improving, data storage offerings continue to grow, evolve, and inspire new services. Having a better understanding of these different services can help us build better applications.

This course will help prepare you for the Google Professional Cloud Developer Certification exam, which requires a working knowledge of building cloud-native systems on GCP, and covers a wide variety of topics, from designing distributed systems to knowing how to create different storage resources.

This course focuses on the third section of the exam overview, concentrating specifically on the last four points, which cover data storage creation, networking, and security services.

Learning Objectives

  • Create data storage resources
  • Deploy and implement networking resources
  • Automate resource provisioning with Deployment Manager
  • Manage service accounts

Intended Audience

  • IT professionals who want to become cloud-native developers
  • IT professionals preparing for Google’s Professional Cloud Developer exam


  • Software development experience
  • Proficient with at least one programming language
  • SQL and NoSQL experience
  • Networking experience (subnets, CIDR notation, and firewalls)
  • Familiarity with infrastructure-as-code concepts

Hello and welcome. In this lesson, we're going to be talking about how to create several different Google Cloud storage resources. As the exam guide showcases, storage is a very general topic. It's more than just SQL databases or blob storage. The exam guide lists out several services that you should know before taking the exam which includes source code repositories, Cloud SQL, Datastore indexing, BigQuery datasets, Cloud Spanner, Cloud storage buckets, and Pub/Sub topics. We have a lot of information to get through here, so let's dive right in. 

Cloud source repositories are Google Cloud's hosted Git-based version control service. If you're not yet familiar with Git, I highly recommend taking some additional time to become familiar with at least the basic functionality before sitting the exam. With so many different great hosted Git services on the market, why is it that Google wants you to know about source repos? I think it's likely due to its integration with other GCP services. For example, using Cloud Build triggers allows us to execute code in response to changes in our code base. Now, that provides the basis for an automated continuous integration pipeline.

There are other integrations such as with App Engine, Cloud Debugger, etc, that depending on your deployment target, may be relevant. There are basically two methods for creating Cloud source repos: the first is an empty repo and the second is a mirrored repo. Both are just hosted Git repositories. However, the way that you'll interact with the two may be different. A mirrored repository is just a read-only copy of an original. In contrast, an empty repo is just a standard Git repo that allows reads and writes. Creating a new repo is fairly simple. The only real information that we need is the name of the repo and the project. In the console, we just fill out a basic form that provides that info, with the SDK, we can issue the gcloud source repos create command.

Creating a mirrored repo allows us to sync an existing GitHub or Bitbucket repo. The syncing process is one way from the external repo into this mirror. Creating a mirror requires providing Google Cloud with read-only access to that repo. Interacting with a cloud source repo using standard Git commands requires access. And for that, there are three different options. There's SSH, there's SDK, and there are manually generated credentials. SSH just requires public SSH keys to be uploaded to Google Cloud. SDK requires authenticating with the command line interface and after that, you can use standard Git commands, and they'll be authenticated. Manual requires using a link which can be found in the console, and when you click on that link, it opens up another page that contains a code snippet that you can actually run in a shell, and that's going to authenticate you.

So summarizing cloud repos, cloud repos are created as either empty or mirrored repositories. And after we authenticate, we can use standard Git commands to interact with the repo. 

Next up is creating Cloud SQL instances. Cloud SQL is a regional service capable of running in one or more zones, depending on replication. Now, what that means is that when we create a SQL instance, we select a region, and for a single zone, we can select a zone. Regional deployments allow us to have an automatic failover and that will automatically use multiple zones. Regions can't be changed after the instance is created, though we can change zones if wanted to. 

With Cloud SQL, an instance is an abstraction representing the database engine to use, the virtual machine type, the storage type, the network settings, database-specific flags, etc. Some of the instance settings are generic, and others depend on the database engine that you select. There are two instance classes. There are first and second generation. First generation are deprecated and have been replaced by second gen, so we're only going to focus on second gen here.

Creating an instance with the SDK requires issuing the gcloud SQL instances create command with the required parameters. Creating an instance with the console provides a basic form, instances require an identifier and a region, optionally a zone, a database password, as well as the database engine and version. They have some additional requirements. However, the console does set some sensible default that make instance creation just a little bit easier, but let's take a look at these.

Under the Connectivity section, we have the Private IP address check box that allows an instance to be connected to a VPC network, and allowing private IP addresses for that connectivity. Once you enable that, it can't be undone. The check box for the public IP address determines if the instance should receive a public IP address, and that allows it to be externally accessible on the internet, which does add some security considerations. The Machine Type settings allow us to specify CPU and memory, as well as the storage type and capacity. There's also a setting here that allows the storage capacity to automatically increase as needed, which will check the available storage capacity every 30 seconds, and permanently increase it if needed.

The backup schedule allows us to set an acceptable time window for backups, should we actually use them, which is a recommendation. 

Now, this next setting is a bit inconspicuous, however, it allows us to run either a single standalone instance inside of a single zone, or to have GCP automatically create read-only replicas in another zone. Using regional means that if the primary instance experience is a failure, it can promote the replica to become the primary, so this one setting can increase your potential availability. 

The Flags section here is for database engine specific flags should we have any that we want to set. The Maintenance section contains settings related to scheduled instance maintenance. Cloud SQL is a managed service and that management has to happen at some point, so this allows a setting for us to have a bit of say in when that happens. 

Okay, so, Cloud SQL instances are an abstraction that combines configuration settings for the desired database engine, machine type, storage type, instance location, and connectivity settings. Once created, an instance belongs to a region, and that can use one or more zones, depending on our availability settings. 

All right, next up, composite indexes with Datastore. All the queries in Datastore require an index. Indexes are automatically created for each individual property for both ascending and descending. These built-in indexes support what Google calls simple queries. The built-in indexes aren't displayed anywhere, so you just have to know that they exist. Complex queries require composite indexes. And a composite index is specified in an index.yaml file. They specify the properties to index, and the direction either ascending or descending. The file can be created manually or it can be generated. The local development Cloud Datastore Emulator is actually able to create the index.yaml file for you based on the queries that you run on your local environment. Now, this makes it easy to generate the file without any guesswork because it uses the queries you run in your dev environment to figure out exactly which indexes it needs to build. 

Once you have an index.yaml file, the gcloud datastore indexes create command is used to upload the files and have the indexes built. So, start to finish. Once you have complex queries, you need a composite index. You may be wondering how you'll know if you need a composite index. Since all of the queries require an index, if you attempt to make a query, and it doesn't have a supported index, you're going to get an error. Index files are defined in an index.yaml file that can be either manually created or generated with the Datastore Emulator, and then they can be uploaded with the gcloud command, and built by the database, and the time it takes to build that index depends on the size of the existing dataset.

Next up is creating a BigQuery dataset. A dataset in BigQuery is a top-level container for tables, views, and access control. There are a few ways to create a dataset. They're all roughly equivalent. The required parameters are a unique per project dataset ID and a location that can't be changed after creation. There are two types of location which are regional and multi-regional. Because the location can't be changed after a dataset is created, you're going to wanna consider the location of your external data sources and where you might be using this data before selecting a location.

Because a dataset is a top-level container, once you have one you can add tables of which there are three types. There are native tables, external tables, and views. Native tables use BigQuery's storage layer to persist the data. External tables use external data storage such as Cloud Storage, which may not be as fast as native tables, though it does have its own use cases. BigQuery also has views which are similar to views inside of a SQL database, they're read-only, virtual tables that are generated based on the results of a query. With BigQuery, the lowest level of granularity for access control is the dataset. So, that means that you can't control access to specific tables, though you can control access to which users have read or write access to the dataset.

So, creating a dataset is done by specifying a dataset ID, and an immutable location. Once complete, access control settings and tables can be created in the new dataset. 

Next up is Cloud Spanner. SQL databases have been a staple in software engineering for decades, and with good reason. They provide strong consistency, transactions, data types, etc. One challenge with SQL databases has been horizontal scaling. Granted, these days it's fairly easy to create read replicas for failover, though there are use cases where you want more than just failover. In recent years, SQL alternatives have grown and evolved into NoSQL databases of all different types. These are typically distributed key-value stores with limited data types, and then maybe some additional functionality built on top. Most provide eventual consistency, where the data will make it to all of the nodes eventually, though that means sometimes you might get stale reads.

Now, there are many use cases that would be optimal to have a SQL database if it was globally distributed in a way that doesn't detract from the benefits of SQL. Now that's what Spanner aims to be. It's not a bunch of glue code on top of an existing Postgres MySQL or a similar database engine. It's Google's own distributed SQL database. The reason why that's important is because it impacts how we as developers leverage the service. In Spanner, the top-level abstraction is called an instance. It's not like a virtual machine instance in that Spanner doesn't just represent a single server. So a Spanner instance is a single implementation of a Spanner database, which consists of multiple server nodes.

Each Spanner instance contains one or more nodes. Each node provides two terabytes of disk space for the given instance, and it also provides a fixed amount of CPU and memory. Because Spanner is a distributed database, each node has a set amount of replicas running either within one region or multiple regions. Each replica is a virtual machine instance meaning that you're charged for the number of nodes multiplied by the number of replicas. And the number of replicas depends on whether it's a regional or multi-regional deployment. 

Let's talk about actually using Spanner. Tables consist of columns and rows. Columns have a data type, and tables require a primary key. Okay, technically tables do not require a primary key. However, any table without a primary key can hold only one row, so they kind of require a primary key. When it comes to using Spanner, here's our first gotcha. Spanner is a distributed database and it uses a key space to determine which server is going to hold which data. As a concept, that only works if the keys are not monotonically increasing numbers. For example, imagine you have a primary key that is a 64-bit int. And it's just incremented by one for each new record. All these records are going to be grouped together and added in the same section of the key space meaning they'll all end up on the same server, and that causes what are referred to as hot spots, which can end up being a problem later on. Knowing that means that you need to put some thought into the primary key structure in advance.

Next, Spanner provides transaction mechanisms for both reading and writing data. When reading multiple data at the same time, read-only transactions allow for reading only data available as of the read timestamp. Read-write transactions provide a similar consistent view of the data up to that point, including a write lock to prevent the same data from being changed by another process. Using read transactions ensures strong consistency. However, there could be a use case where latency is more important than consistency. Specifically, as relates to multi-region instances. Spanner does support stale reads, and that allows us to specify our acceptable level of staleness and get stale data, though at a faster rate. 

The exact implementation of Spanner is beyond the scope of this course, however, if you nerd out on this sort of stuff, and I do, I highly recommend taking the time to delve into this further. 

There is a mechanism called TrueTime, and that serves as a basis for the external consistency in Spanner, and it's rather interesting. When planning to use Spanner, consider whether or not you really need a distributed SQL database. With so many instances up and running behind the scenes to serve as node replicas, the cost per hour is gonna add up quickly. If this really is an optimal solution, make sure you also consider regional versus multi-regional. Multi-region will increase the cost and write latency, though it does reduce global read latency, so that may be a trade-off for certain use cases. Also, compliance standards such as GDPR may impact your decision as to whether it's regional or multi-regional when it comes to data sovereignty. 

Okay, up next, let's talk about Cloud storage. Cloud storage is a Blob service with builtin replication. When creating a bucket, there are just a few parameters to specify: bucket IDs need to be globally unique, location type determines how many regions data will be replicated into, and location determines specifically which region or regions to use. Because it sports multiple storage classes, it requires a storage class for every object, and to help with user experience so it doesn't have to be specified for every object, there is an option for a default storage class at the bucket level. Storage class is used to determine how often data is going to be accessed, and that will impact cost. Standard is used for data that is going to be frequently accessed. Near and codeline are for progressively less frequent data. 

The final learning objective for this lesson is creating a Pub/Sub topic. Pub/Sub is a message cue that allows us to subscribe to messages. Publishers can send messages to a topic, topics send the messages on to any subscribers, and consumers can take those messages from the subscription and act on them. Creating a topic requires a name and an encryption key. Once a topic exists, it's capable of receiving messages. However, without any subscriptions, there's nothing consuming the messages. A single topic can have multiple subscriptions. Each subscription specifies its own rules about how message consumption should work. For example, how is the message going to get to the code? Will it be pushed via a URL or will our app pull on some schedule?

To help clean unused subscriptions, we can set an auto expiration that will have the subscription removed if it hasn't been used in a certain number of days. If we're going to pull data, then the consumer needs to be able to respond to indicate that this message was successfully processed. The acknowledgement deadline specifies how many seconds a consumer has before the message is then sent out for other consumers. Subscriptions attempt to ensure messages are successfully delivered, however, they can't stay around forever, so we have to actually set a retention period for these messages, and the range is between 10 minutes and seven days. All right, let's wrap up here. We have covered a lot in this section. If a lot of this was new to you, get into the console, start trying this stuff out for yourself, make sure you're really comfortable with these different services. Firsthand experience really is key to helping a lot of this stuff stick. All right, thank you so much for watching. I will see you in another lesson.

About the Author
Learning Paths

Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.