Data Loss Prevention
This course is a short introduction to Google Cloud Data Loss Prevention, which is a service that finds and de-identifies sensitive data, such as birthdates and credit card numbers.
- Describe Cloud Data Loss Prevention
- List the supported data sources and types
- Explain the concept of an information type
- Explain the concept of a job
- Anyone who is interested in data security services on Google Cloud Platform
- Some experience using Google Cloud Platform
Hello and welcome! This is Introduction to Cloud Data Loss Prevention. By the end of this lesson, you should be able to describe Cloud Data Loss Prevention, list the supported data sources and types, explain the concept of an information type, and explain the concept of a job. This lesson will be a mix of concepts and demonstrations. We won't go into a lot of depth. However, I'll include links for additional reading where I think it's useful. So, if you're interested in learning more, then let's get started.
For many of us, the internet is integrated into our daily lives, which results in more of our data being stored digitally. Some of the data we put out into the world is rather benign, and we just don't care if it's public. However, we all have data that we expect to remain private. That's information such as your Social Security Number, your passport number, medical history, and so on. Certain information is sensitive, and in the wrong hands, it can be abused. As companies collect our ever-increasing volume of data, it is up to them to ensure it's handled properly. And if I'm being honest, most companies just do not do a good enough job. According to USA Today, in 2018 billions of people were affected by data breaches, which, according to Forbes, cost around $8 million per breach in the U.S. alone.
When companies store an individual's sensitive data, it's their responsibility to keep that data private, which is why there are so many different global governmental regulations pertaining to data handling. Companies are required to adhere to some of these standards for data handling, and it's really up to them to know which ones and ensure compliance. To help with data security, Google Cloud offers Cloud Data Loss Prevention, also referred to as Cloud DLP, or in the context of Google Cloud, it might just be called DLP.
Cloud DLP is a service which allows customers to inspect their data for potentially sensitive information and optionally redact it. So, at the risk of angering the engineers who work on this, at the risk of oversimplifying, imagine it as a giant find and replace for sensitive data.
DLP is a standalone REST API, and by that, I mean it's useful on its own independently, though it can also integrate with some of the other cloud services. It provides endpoints for inspecting data, for de-identifying data, and for even re-identifying data in certain cases. When I say data, what I mean by that is either structured or unstructured text, or images which may contain sensitive information, so imagine having a picture of a medical record. When you inspect your data with DLP, you need to provide a list of the types of sensitive data for which you want to search.
By default, DLP can detect over 90 different types of sensitive information, which are called information types, or just infoTypes. These include information such as names, dates of birth, Social Security Numbers, emails, phone numbers, and so on. And if none of the defaults work for your use case, you can also create custom infoTypes that are either based on a dictionary or a regular expression.
After completing an inspection, DLP returns the findings, which will show all of the matches along with the likelihood that that match is the expected information type. Match likelihood is based on categories ranging from very unlikely to very likely. Now, once you find sensitive data, that is your first step. Next, it needs to be secured, which involves transforming the matches that you find into a format that can't be re-identified. DLP has different types of transformations, such as redaction, replacement, masking, encryption, date shifting, and more. Some of these types of transformation are one-way. They can't be reversed, once you do it, that's it, and others are reversible.
Let's check out an example using the API Explorer. I'm here on the Overview page for the already enabled DLP API. At the bottom is a link to the API Explorer, and this is gonna open up the Data Loss Prevention version two. The endpoint used to inspect data is listed under dlp.projects.content.inspect. Right, so this is our inspection endpoint.
Now, using this structured editor, you can select the properties, you can build out this JSON object. However, I'm going to use the freeform editor. I'm gonna paste in the JSON that comes directly from the DLP documentation.
All right, let's pause here. I wanna review what's happening. This is passing the inspect API a request to inspect this unstructured text. This section is telling the API to search for any phone numbers which includes any toll-free numbers. Then it asks only to show results with a minimum likelihood of POSSIBLE. We don't want everything, just if we really think it's possible. And then we have this setting here which will ask the DLP service to return whatever sensitive data it detected. That way, we can see, we can determine if it really is sensitive data. So, our goal here is to search through this text for any phone numbers.
Now, before running it, I just need to provide a parent resource, and in this case, it is going to be my project. Okay, so let's run this by clicking Authorize and execute, and logging in. Great!
Now, scrolling down, you can see the API Explorer has printed the Request, and further down is our Response. The status code of 200 tells us that the API call was successful. And the JSON object tells us that DLP did find a LIKELY match for a PHONE_NUMBER.
All right, having used the inspect endpoint to find the sensitive data, let's try and de-identify something with the de-identify endpoint. Once again, I'm just gonna use the JSON found in the documentation, so you can grab this documentation, you can test it for yourself, and I'll paste that here, and I will type in my project as a parent resource.
Okay. So, this is going to inspect the unstructured text, which says My email is firstname.lastname@example.org. And if it finds an email address, it will replace it with the name of the infoType. So, running this shows the de-identified results, My email is, and then we have our infoType in brackets. The rest of these details here in the Overview section are useful for knowing exactly what has changed.
I mentioned previously there are multiple types of transformations. The one we just tested replaces the match with the name of the infoType. There are other options. We could do things such as mask with certain characters, so picture maybe a Social Security Number where we wanna mask out the last four. This example here will find an email address, replace all of the characters except for the at symbol and a period, and it's gonna replace them with a pound sign. And here it is.
So, when it comes to actually de-identifying, check out the different types of transformations and see if one of those fits your use case. So, I mentioned before that DLP is a standalone API, which is true, though it does integrate with some of the Google Cloud Platform Storage Services, such as Cloud Storage, BigQuery, and Cloud Datastore.
So, this means we'll be able to use DLP to scan through the data in these different services and either find sensitive data or maybe redact it. Scanning through these different services is not the same as scanning through text because DLP needs to know which service we want to scan, what data in that service we want to scan, et cetera.
So, because there's all this additional configuration information involved in scanning these integrated services, DLP abstracts away all of those details through the concept of a job. There are two types of jobs. There are inspection and risk analysis jobs. Inspection jobs inspect the data for specified infoTypes. And risk analysis jobs scan de-identified data and try and determine if it can be re-identified through several different algorithms. And jobs can be run either immediately or scheduled to be run as recurring by the use of a job trigger.
Let's test this out by scanning files inside of a Cloud Storage bucket. I have two files, one is a text file, the other is an image, and they both display the same fake information.
I'm on the DLP page, which is accessible by clicking on Security in the main menu and selecting DLP. Clicking on Create at the top here and selecting Job and job trigger will open up a basic form.
So, we start off with a job ID. We use a job ID to look up the status of our job via the API. Selecting the Cloud Storage option here will open up some contextual fields. Since the bucket we're scanning is in this project, I'm going to use the include/exclude location type, which will allow me to select from the user interface.
This slider here allows for a percentage of objects to scan. Now, this is really useful if you're dealing with a massive amount of objects. Maybe you just wanna take a sampling and see if there happens to be in a random sampling some sensitive data. In our case, there are two files, so there isn't a whole lot to explore. We're gonna set this to 100%, and that way both files actually get scanned.
And I'm also going to disable this sampling here. This determines how much of the file is actually going to be inspected. Because we have such small files here, we can actually review them all. If these were very large files, we might wanna reconsider and just do a sampling, but for now, in this demo, we'll just do the whole thing. I'm gonna leave this file type set to All supported files, though notice you can set specific types.
This part's cool. We can actually use a regex pattern to include or exclude files and folders. So, each of these services has their own sets of options. If you wanted to scan BigQuery or Datastore, the options might look different. Just know, regardless of which integrated service you're going to scan, you use the concept of a job.
So, next up, we need to select the infoTypes that we actually want to detect. Okay, after clearing all of this out, we can set the infoTypes of PERSON_NAME and PHONE_NUMBER. And we don't need to create any infoType, so I'm gonna leave that setting alone.
The inspection rules here help to provide some context about our data. There are two types of inspection rules, exclusions and hot words. You use an exclusion to prevent a match based on either a value in a dictionary or through some sort of regular expression.
So, for example, imagine scanning text files, which happen to include the phone number of your company. So, here you could use an exclude rule that tells DLP not to consider that an actual match for phone number. The other type of rule is a hot word, which allows you to adjust the likelihood of a match based on its proximity to certain hot words.
For example, imagine you're scanning for a date of birth. Now, that looks a lot like any other date. However, maybe the likelihood of it being a date of birth increases if it's near the words born, or maybe near the abbreviation DOB. In this case, you could add hot words for DOB or born within 30 characters. And if you find a date near one of these terms, you adjust the likelihood that this is a match. The confidence threshold determines the minimum likelihood to return. The actions section determines what to do, if anything, after a job completes successfully. So, these are all pretty self-descriptive.
The Save to BigQuery action does just what you think. It allows us to query the results of our scans for more detail in BigQuery.
Let's test that out. I've already created a BigQuery table for these scan results, so I'm just going to save the data there. And I do also want to include the matched data. So, for that, I'll select the Include quote option, and I'll configure these settings to point to the scan_results table.
Right, this next section allows the job to be run immediately or scheduled using a job trigger. The review section displays the JSON that is gonna be sent to the create endpoint which we could also use to create a job through the API. Let's check out the results in API Explorer. Though, in order to do that, we need the job ID, which we can just copy right here from the Detail page. This name field requires a fully qualified resource name.
Clicking Execute requires that I first authenticate. And here are the results. So, it shows that there are two matches for each infoType, which is what we expect, because there are two files with the same data. We have the image and we have the text. Back on the Job details page, you can see that the job is complete, and it found our two matches.
If we head over to BigQuery, we can see the results. And I already have this query set here, so I'm just going to run it, and here are our results, which shows the matched info, the likelihood, the infoType, et cetera. You may have noticed I have doubled the results here. I ran this off camera to test it out just to make sure everything is working, though you won't see that unless you also run it multiple times.
All right, so that's the basic functionality. Before you use DLP for yourself, I want you to keep in mind the cost. The pricing model for DLP is based on usage, though it's really not the most intuitive model, and it can get very expensive.
Regarding the pricing model, there's basically two units of measure, which are inspection units and transformation units. Charges are based on these two units. And reading directly from the pricing documentation, it says inspection types are based on the total number of bytes inspected and the number of predefined or custom infoTypes used for inspection. Now, I don't find that all that helpful on its own. So, if you're considering using DLP, I highly recommend you review the pricing page, get a bit more context, use a pricing calculator first as well, just to make sure you're not getting charged more than you anticipated.
All right, let's wrap up here and summarize what we've covered so far. At the start, I said the learning objectives were that you would be able to describe Cloud Data Loss Prevention, list the supported data formats and types, explain the concept of an information type, and explain the concept of a job. So, I would describe it as a find and replace for sensitive data, which, I mentioned before, it is a bit reductive, though it is also accurate. So, that's my description.
All right, on the screen, I have here several data types and sources. How many of these do you recognize from the lesson? DLP is able to scan for structured and unstructured text. It's able to scan images. It also integrates with Cloud Storage, BigQuery, and Cloud Datastore.
Information types. These are an important concept in DLP because they represent a piece of sensitive data. DLP provides 90 built-in types and allows for custom types. Info types are the type of sensitive data that you're searching for and probably hoping not to find unless maybe you're using this for nefarious purposes in which case you're hoping to find them.
Jobs represent scanning one of the supported GCP storage services. Jobs can be created and immediately run. They can also be scheduled with a job trigger.
All right, that will do it for this lesson. I hope this was a helpful intro to Cloud DLP. And I'll see you in the next lesson.
Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.