image
Managing Google Cloud Storage
Storage Management
Difficulty
Intermediate
Duration
11m
Students
650
Ratings
5/5
starstarstarstarstar
Description

Once you’ve set up your Google Cloud Storage buckets and applied the right security settings to them, you’ll still need to manage them on an ongoing basis. In this course, we’ll show you how to upload data to your buckets, transfer data from another provider or from your on-premises environment, and implement object lifecycle management.

Learning Objectives

  • Upload data to Cloud Storage
  • Transfer data to Cloud Storage from another provider or from your on-premises environment
  • Implement object lifecycle management in Cloud Storage

Intended Audience

  • Anyone who manages Cloud Storage data on Google Cloud Platform

Prerequisites

  • Experience creating Cloud Storage buckets on Google Cloud Platform
Transcript

Once you’ve set up your Cloud Storage buckets and applied the right security settings to them, you’ll still need to manage them on an ongoing basis.

One of your first tasks will likely be to get data into your Cloud Storage buckets. If you need to upload data from an on-premise location, then you have three options:

  1. The Cloud Storage console
  2. gsutil, or
  3. Offline media import / export

The easiest way is to click on a bucket in the Cloud Storage console and then click “Upload Files” or “Upload Folder”. You can even view the uploaded file from the console if it’s the type of file that a web browser knows how to display.

The second way is to use the “gsutil” command. For example, to upload a folder called “example-folder” from your desktop, you would use “gsutil cp -r Desktop/example-folder” and then put “gs://” and the name of the bucket. Now, if we go back to the Cloud Storage console, we can see the uploaded folder.

If you have a slow or expensive Internet connection, then you may want to use the third option, which is to ship your data on offline media. One way is to send hard disks, storage arrays, tapes, or other media to a third-party provider, such as Iron Mountain, and let them upload your data. 

Another way is to use a Transfer Appliance supplied by Google. Here’s how it works. You submit a request for a Transfer Appliance, which Google then ships to you. When you receive it, you install it in your data center and transfer your data to it. Then you ship it back to Google so they can upload it for you.

There are still more steps to perform, though, because the data on the Transfer Appliance is encrypted. First, Google uploads your encrypted data to a staging bucket in Cloud Storage. Next, you have to launch and configure what’s called a rehydrator instance. Then you run a rehydrator job on the instance, which will decrypt the data and copy it to a Cloud Storage bucket of your choosing. Finally, you delete the instance and send Google a request to erase the data from the Transfer Appliance and the staging bucket.

If you need to transfer data from another cloud provider, then you can use the Google Cloud Storage Transfer Service. I’ll show you how it works.

Click “Transfer” in the left-hand menu. Then click the “Create transfer” button. Under “Select source”, there are three options: you can transfer from another Google Cloud Storage bucket, an Amazon S3 bucket, or from a URL. I’m going to transfer from another Cloud Storage bucket to simulate transferring from another cloud provider because it works the same way.

Click the Browse button and select the bucket. Then say which files to include in the transfer by specifying their prefix, such as “example”. You can also exclude files in the same way. Another option is to specify that you only want files modified recently, say in the last 24 hours. Of course, if you want to copy the entire contents of the bucket, then you don’t need to put in any filters.

Now choose the destination bucket. There are also some options for how it handles overwrites and deletions. By default, an object will only be overwritten when the source version is different from the destination version. Also by default, no objects will be deleted from either the source or the destination. I’ll just leave it with the default settings.

Then you specify when you want the transfer to happen. You can either run it now or schedule it to run at a particular time every day. I’ll just run it now. The transfer is going to take a little while, so I’ll fast forward until it’s done. It’s a bit tedious to transfer files manually like this, which is why scheduled transfers are so nice. For example, you could have it run automatically every day to check for new files in the source bucket and transfer them to the destination bucket. Alternatively, you could do this with a cron job that runs the gsutil command, but it’s much cleaner to do it this way.

If you have any experience with managing data, then you know that data just keeps growing and growing over time, and if you don’t implement something to keep that growth under control, then you will either run out of storage space or have runaway costs. Since Google Cloud has nearly unlimited storage resources, that means you could easily have runaway costs.

To prevent that, you can use object lifecycle management. There are two ways to control costs using lifecycle management. Based on age, creation time, or number of newer versions, either:

  • Delete objects, or
  • Move objects to a cheaper storage class, such as Nearline or Coldline Storage

You can manage object lifecycle policies through the Cloud Storage console, the “gsutil” command, or the Google API Client Libraries. I’ll show you how to do it with gsutil.

First, you need to create a lifecycle config file that contains the rules you want to set. Here’s an example in JSON format [vi lc1.json]:

{
"lifecycle": {
  "rule": [
  {
    "action": {"type": "Delete"},
    "condition": {
      "age": 365,
      "isLive": true
    }
  },
  {
    "action": {"type": "Delete"},
    "condition": {
      "isLive": false,
      "numNewerVersions": 3
    }
  }
]
}
}

The first rule says to delete any live object older than 365 days. This might seem like a short time to live, but it’s not as draconian as it looks because it won’t completely delete the object if you’ve enabled versioning on the bucket. It’ll make it a noncurrent version instead. However, if you don’t have versioning enabled, then a Delete action will completely delete objects matching the condition and there will be no way to get them back. This is why you should test your lifecycle rules on test data before applying them to production data.

The “isLive” parameter only matters if you’ve turned on versioning. If an object is live, it means that it’s the most current version of that object. If it’s not live, then it’s one of the noncurrent versions of that object.

Let’s see if versioning is enabled on this bucket. I’ll use the gsutil command to check.  It’s gsutil versioning get and then the bucket name. It says it’s suspended. What the heck does that mean? It means versioning is disabled. I don’t know why they say suspended instead of disabled.

To enable versioning, we just need to change “get” to “set on”. I have a file in the ca-example bucket called “examplefile”. Now, I’ll upload a different version of “examplefile” and see if it makes the old version noncurrent. OK, it’s uploaded, now I’ll run the “gsutil ls -la” command on that file. Yes, there are two versions of it now and they have different dates and sizes. Note that if you do an “ls -l” without the ‘a’ flag, then it won’t show the different versions, so make sure you include the ‘a’ flag.

OK, let’s get back to the lifecycle policy. The second rule says to delete any object that has at least 3 newer versions of itself, including the live version. Unlike the first rule, this one really will delete the object because when you delete an noncurrent object, it gets deleted forever. Although this rule has an explicit condition that the object must not be live, you don’t actually need to put in that condition because if an object has three newer versions, then it can’t be live. It has to be a noncurrent version.

So, looking at the big picture for these two rules, there are two possible scenarios, depending on whether versioning is enabled or not. If versioning is enabled, then the first rule will make any object that is more than one year old noncurrent, and the second rule will delete any object that has at least three newer versions of itself. If versioning is not enabled, then the first rule will delete any object that is more than one year old, and the second rule will not do anything.

Suppose that instead of deleting objects older than one year, you’d like to send them to Nearline Storage, which is significantly cheaper, and send objects in Nearline Storage that are older than 3 years to Coldline Storage, which is even less expensive.

Here’s a lifecycle config file that will implement this policy.

{
"lifecycle": {
  "rule": [
  {
    "action": {
      "type": "SetStorageClass",
      "storageClass": "NEARLINE"
    },
    "condition": {
      "age": 365,
      "matchesStorageClass": ["MULTI_REGIONAL"]
    }
  },
  {
    "action": {
      "type": "SetStorageClass",
      "storageClass": "COLDLINE"
    },
    "condition": {
      "age": 1095,
      "matchesStorageClass": ["NEARLINE"]
    }
  }
]
}
}

The first rule moves objects that are older than one year from Multi_Regional Storage to Nearline Storage. 

The second rule says that if an object is in Nearline Storage and it is at least 1,095 days old (which is 3 years), then it should be moved to Coldline Storage.

To apply this lifecycle policy, you type “gsutil lifecycle set”, then the name of the config file, which is “lc2.json” in this case, and then the URL for the bucket, which is “gs://ca-example” in this case. Again, make sure you apply a lifecycle policy to test data before you put it into production or you risk losing valuable data.

Once you’ve set up a lifecycle policy, you can monitor what it’s doing in two ways:

  • Expiration time metadata, and
  • Usage logs

To see the metadata for an object, use “gsutil ls -La” on the object. I’ll just show you the first dozen or so lines of the output because it prints all of the access control list information, which we’re not interested in right now. The output from this command may or may not contain expiration time metadata (and it doesn’t for this file), but the lifecycle policy should add that metadata when it knows the date and time that an object will be deleted.

Bear in mind that updates to your lifecycle configuration may take up to 24 hours to go into effect. Not only will it take up to 24 hours before your new rules kick in, but your old rules may still be active for up to 24 hours, so if you discover a mistake in your rules, that bug can still be active for another 24 hours after you fix it, which is another reason why testing your rules on test data first is so important.

Expiration time metadata is useful to see when your lifecycle policy is planning to delete an object, but it won’t show any of the other potential operations, such as moving an object to another storage class. If you want to see all of the operations that your lifecycle policy has actually performed, then you can look at the logs.

If you haven’t already set up usage logs for your bucket, then here are the commands you need to use. First, create a bucket to hold the logs. Remember to change “ca-example-logs” to your own log bucket name.

Then you have to give Google Cloud Storage WRITE permission so it can put logs in this bucket.

Next, you can set the default object ACL to, for example, “project-private”. You don’t have to do this, but it’s a good idea for security purposes to keep your logs private.

And finally, you enable logging with the “gsutil logging set on” command.

Once you have logging set up, then you can go into your logging bucket in the console, and the usage logs will show up there after the lifecycle policy has made changes.

And that’s it for managing Google Cloud Storage.

About the Author
Students
201283
Courses
97
Learning Paths
167

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).