One of the hardest parts of data analysis is data cleanup. In this course, you will learn how you can use Cloud Dataprep to easily explore, clean, and prepare your data.
- What Cloud Dataprep can do
- Differences between editions
- How to import a dataset
- How to create a recipe
- How to create and execute a flow
- GCP Data Scientists
- GCP Data Engineers
- Anyone preparing for a Google Cloud certification (such as the Professional Data Engineer exam)
- Access to a Google Cloud Platform account is recommended
At this point you should understand the basics behind Cloud Dataprep. It’s now time to give you a practical demonstration. First, I will show you how to enable Cloud Dataprep for a GCP project. Next, I will show you how to create a flow, by importing data and creating a few recipes. At the end, I will show you how to save the transformed data to a CSV file. So, let’s get started.
The first step is to log in to the Google Cloud Console. You should use an administrator account with the ability to make changes to a project. In this case, you can see that I already created a new project specifically for using Dataprep.
So now let’s go to the Dataprep tool to begin setup. You can find the link in the side menu or search for Dataprep like I am. Now if this is your first time using the tool, you are going to have to go through a good amount of steps. Mostly this is going to be a bunch of checkmarks you have to click to accept the various terms of service. Ideally you should be carefully reading everything you are agreeing to.
So this is all because the Dataprep tool is actually hosted outside of Google. So this is Google’s way of letting you know, and making you aware that a third party is involved. Now Dataprep does not actually store or process your data. So this shouldn’t be too big of a concern for most. Dataprep will read your data for samples. This is so you can visually explore your data. But they don’t save the data so in theory, a Trifacta data breach should not compromise any of your data. All data is actually processed by Google tools on GCP.
So, you do need to allow Trifacta to be able to access your project. That’s why it’s probably a good idea to create a separate project specifically for using Dataprep. That way, if something bad did happen, the damage would be limited and not compromise other projects.
So next you need to confirm the Google account you want to use. I just have the one, so it’s an easy choice here.
Here’s another confirmation screen. Just keep clicking, it will be over soon. I promise.
Ok, so at this point we see the URL has changed to a different domain. Now we are technically outside of Google. So, before we can start using the tool we have to select the edition we want to use. Trifacta has made this easy, because we aren’t actually given a choice. They basically force you to take a 30 day free trial of Professional. After 30 days, you need to decide which version you are willing to pay for.
For now, we will accept the default, mostly because it does not give us any other choice.
I bet you thought we were done with accepting terms. Nope.
Ok, we are almost done with setup. We have one last choice to make. Dataprep isn’t going to hold or store our data. Instead, we will keep everything on GCP. So here we need to select a default location for storage. This means when we upload a CSV file, it has a secure place to save it to. It will create a default location for you, so I’m just going to accept that. If you wish to specify something else, feel free to do so.
You also want to confirm that the correct project is specified here as well. It should have used the one you had active on GCP, but double check to make sure.
Now we are finally done with setup. You don’t need to go through all of that on subsequent visits. From now on, when you go to Dataprep it should take you right here.
From here we can start importing data. I guess the assumption is since this is your first time, you want to immediately start uploading data. Like most tools there are several different ways for accomplishing the same task. I’m actually going to leave this screen, and we will start on the flow screen instead. So let me take you there.
The flow screen will show you all of your existing flows. You will have a few examples already created for you. But I’m going to show you how to create a new one from scratch. Let me click on the Create button to start. And then I’ll choose to start with a Blank flow.
Ironically, the flow actually isn’t literally blank. We are provided placeholders for a very simple, basic flow. We have a step for importing data, a step for transforming our data, and a step for saving our data. So we have an import dataset, that feeds into a recipe, that ends up in an output. This is about as simple as you can get.
So we can click on each step, fill out a little information, and create our very first data pipeline!
I will start with importing my data. Now I have already grabbed a public dataset of IMDb names. This is a collection of actors, along with some basic information about them.
So let me name this flow Actor Details, because that is what I am going to be looking at. We will import the data as a CSV, clean it up a bit to remove any issues, and then save it as a new CSV. So here I will upload my data. I can use drag and drop the file. Now I just have to wait for the upload to finish.
When the upload has finished I get a small preview. This allows me to verify that I uploaded the correct file. I see a few famous actors and actresses, so it’s looking good so far. I’ll rename this step and then confirm adding the import to the flow.
We defined our import and now we just need to create a recipe and configure the output. If I click on the dataset you can see the details. Again, we get a preview of the data. We can also see the location of where the data was uploaded and some other details.
We technically do have a recipe here, it is just empty. Right now it won’t actually make any changes to our data because there are no defined steps.
The output is currently using our default bucket. So I could actually run this flow now. It would just import the data, do nothing, and then save a new CSV to a default destination. I’m going to keep the default, but you can specify a different bucket or a different filename if you wish. You also can note that there is a button for running the job now. And there is an option for creating a scheduled job. So you can run this in an ad hoc fashion as I need it. Or you can set up a recurring weekly or monthly job. You also can have separate destinations for both. So your ad hoc job results don’t have to overwrite your scheduled jobs results, or vice versa.
We are currently using the Professional edition, so remember that you might have more or less options available, if you end up choosing another edition later.
Now we are 66% done. The last step is just to create a recipe. Luckily, this is where things start to get more interesting.
I’ll click on the recipe and then click on the edit recipe button. So when you go to edit a recipe the first thing you will see is a preview of your dataset. Well, it looks like you get a little forced tutorial on your first time through. Let me skip through this really quickly. You may want to actually read all these pop ups when you try. But I am trying to keep this demo short.
So now I can see my columns which include things like a unique ID, some name fields, a height field, and a bunch of other stuff. Dataprep has already done a bunch of work on this data. If you look at the column names, you will see they have all been categorized.
The ID field is currently set to string. It knew this by looking at a bunch of samples. The values were not numeric and they weren’t recognizable as something else like dates. So, it picked string. It also picked string for the name fields as well. This is exactly what you would expect.
Now for the height field, it picked integer. This is because it noticed that all the fields were whole numbers. I assume this height field is in centimeters. So Dataprep is smart and it can automatically identify data types. It seems to be pretty good at guessing correctly, but sometimes it can be wrong.
You do have the option to change the data types if you want. Let’s say I actually wanted to support heights such as 176.5 centimeters. I can easily do that by changing the type to a Decimal. Now I can add fractional numbers as well. If Dataprep guesses wrong, you can easily fix that here. Let me set this back to integer.
We have a number of other fields and they have all been assigned data types.
The other thing Dataprep has already done is to scan the fields to determine if there are any issues with the data. So for fields with all green above them, no problems were detected. However, sometimes you will see some grey. This represents fields with missing or null values in them. In this case, some of the actors do not have any height specified. Now this is pretty common. You will often import records with some blank or null values. Depending upon what you need this might be an issue, but usually not. If my final analysis relies upon needing a height, then I probably should do something about the records with missing values.
We can see there are other fields with the same issue. The bio field has some empty values as well. Birth details too.
On the date of birth field, we see another color: red. Now this represents values that are mismatched or do not correspond to the set field data type. Date of birth has been classified as a date field. Which is exactly what I would expect. However, some of the records have a non-date value in this field. If we want to see more details, you can simply click on the red section.
Now we are provided some suggestions:
- We can delete the mismatched rows. That would throw away the entire record.
- We can keep the rows and ignore the problem.
- We can keep the rows and create a new field that will tell us if the value is a valid one.
- Or we can set any mismatched values to null. So we throw away the bad field, but keep the rest of the record intact.
Now, these are not your only options. These are just the ones Dataprep suggests. You can do something else entirely if you want. But often one of these suggestions will be a perfectly valid solution.
To help me decide what to do, I want to see what some of the bad records actually look like. Maybe I can fix the data. You can view examples of the mismatched data by clicking on “Show only affected rows”.
Now I can see that it looks like we don’t have valid birthdays for certain actors. Sometimes we only have the year, but not the month and day. And it looks like some of the fields have the birth location in there as well. So I’m probably not going to be able to fix this. We simply don’t know some people’s exact birthdays.
Now let’s say my analysis requires having an actor’s age. And I was planning on calculating this by looking at “date of birth” and “date of death”. So, if I don’t have a valid date for either, then I want to throw the entire record away. So that’s what I will do. I’ll just delete the entire row.
By choosing a transformation, I have added a step to my recipe. Actually, this is the third step I’ve added so far. A little while back I changed height from Int to Decimal. And then I changed it back to Int again. Those actually created recipe steps as well. So you can see that I currently have a recipe of three steps so far. You can add steps, edit steps, or delete them.
My first two steps are currently pointless. So let me use them to demonstrate how to delete a step. Just highlight the step, click on the dot-dot-dot menu, and select delete. So if you ever create a step by accident, you can get rid of it.
So this is how you build a recipe. You look through your data, figure out what changes are needed and Dataprep will build the steps needed.
So I’ve fixed “date of birth”. Next I am going to fix the “date of death” field as well. It looks like we have the same problem here. So the solution should be the same as well.
It looks like the rest of the fields are either green or grey, so I’m happy with the results. I have created a simple, two step recipe that will remove any records with invalid data in either the “date of birth” or “date of death” fields.
At this point I am going to return to my flow by clicking on the name. So I have all three parts completed. I have defined a dataset to import. I have created a recipe to transform my dataset. And I have an output defined to save the final dataset. If I wanted, I could go ahead and run this job.
Click on the run button. And then do a final review of all your settings. It will tell you where it will save the output and what format it will be in. If I clicked on the Run button now, it would then build and run a Dataflow job. Now this would actually take a while to complete, and the result would look basically the same as my input, but minus a few fields. So I won’t bother actually running it. But this is how you use Dataprep.
Now this is one of the simplest use cases. You can do a lot more. For example, I can add another dataset. You can see there is a button for doing that here. You can import several datasets to merge the records into one big dataset. Or you might want to join two datasets to get a wider array of fields.
You can also add more recipes. Let’s say I wanted to add a second one. You can easily do that by clicking on this Plus button here. Now I have added a branch to my flow. So I am going to clean up the birth and death fields, and save that out to a file. But then I will continue to process the data and do some more steps as well.
Now I could also save the output of this second recipe, or I could just share the resulting data with another flow. So let me add a reference dataset here. This will allow me to create a new flow that will start out with this modified data. I don’t need to export the data to a file and then import it in the new flow. This reference dataset will be available to be used in any future flows I create. Let me name my new recipe.
And if I want to add this dataset to a new flow, it’s as simple as clicking on the Add to Flow button here.
I also wanted to show you a few more advanced transformations you can do. So let me edit the steps for my new recipe.
So recall the height field seems to be in centimeters. Now I am in America, and we typically measure height in feet. So let’s say I wanted to convert this number from centimeters to feet. This is easily accomplished by using a custom formula. So let me select “calculate”. And then pick “custom formula”. Now I just need to add a formula that will divide the number by 2.54. So let me search for the divide function. And now I need to specify what I am dividing and what I will divide it by. So I’ll pick the height field first. And then enter 2.54.
So now I get a preview of what the new column will look like. On the left we have centimeters and the right inches. I can name this new column “height in inches”. Inches is better, but actually we typically use feet. So because there are 12 inches in a foot I can simply multiply the divisor by 12 and then I get the height in feet. So let me change the name to reflect that.
Now I have a new step in my recipe to create a new column based off of a custom formula. So that I don’t get confused, I’ll rename the height field to “height in cm”. And now my recipe has two steps.
Now this looks really good, but there’s one problem. I don’t need that many decimal points. I think going down to a single decimal place is enough. I can understand someone being 5.5 or five and a half feet tall. So I will add one more step and format the number to limit it to only one decimal place.
Ok, now I am happy. I can much more easily understand height in terms of feet. But I also still have the old field in centimeters as well.
There will be certain fields that you simply won’t need. For example, this bio column will not be used in my later analysis. So I can easily get rid of it by deleting the column. There are other fields I won’t need as well. I can dump the birth details column, since it appears to just be a combination of birth date and birth location. The same is true for the death details column.
I am also going to get rid of the spouse and children columns as well. I just want to pare my information down to what I actually need and will use.
There is one last thing I want to do. Let’s say that currently the location of birth contains more information than I need. Maybe I only care about the country, and not the city or region. I can modify this data by doing a replace.
So first let me rename the column “country of birth”. And now I can use the replace function to get rid of everything other than the country. I’ll use a simple regex string to do the replacement. Basically, everything before and including the last comma can be trimmed. So I’ll use regex to match any string ending with a comma and a single space. And then I can replace that with nothing. That should give me just the countries.
Well it mostly worked. It looks like the countries have a double quote after them. But that should be easy to fix. Let me scroll down and see if I can spot any other issues.
Oh wait, I found a problem. It looks like some entries have square brackets in them to explain countries or cities that have been renamed. I’ll need to replace those first.
So let me match on square brackets and replace those with an empty string. Ok, it looks like that worked.
Now let me try to filter out everything before the comma and single space. That looks good. I am just seeing countries now.
I still need to take care of that final double quote so let me add one final replacement for that. Ok, perfect. Country of birth now looks exactly like I want.
So now I need to do the same for “country of death”. Now I can reproduce all the same steps in exactly the same way. Or, I can select the steps I created for “birth country” and duplicate them. Now I just need to modify the duplicated steps to change “country of birth” to “country of death”. So you can see how recipes are really useful. You can copy and paste the steps. You can reuse whole recipes or just parts of them.
And now my data looks exactly the way I want it to. So I think that gives you a good idea of how to get started using Dataprep. This is just a fraction of the things it can do. You have many more advanced options available, when you get a chance you should try it yourself.
Daniel began his career as a Software Engineer, focusing mostly on web and mobile development. After twenty years of dealing with insufficient training and fragmented documentation, he decided to use his extensive experience to help the next generation of engineers.
Daniel has spent his most recent years designing and running technical classes for both Amazon and Microsoft. Today at Cloud Academy, he is working on building out an extensive Google Cloud training library.
When he isn’t working or tinkering in his home lab, Daniel enjoys BBQing, target shooting, and watching classic movies.