The course is part of this learning path
One of the hardest parts of data analysis is data cleanup. In this course, you will learn how you can use Cloud Dataprep to easily explore, clean, and prepare your data.
Learning Objectives
- What Cloud Dataprep can do
- Differences between editions
- How to import a dataset
- How to create a recipe
- How to create and execute a flow
Intended Audience
- GCP Data Scientists
- GCP Data Engineers
- Anyone preparing for a Google Cloud certification (such as the Professional Data Engineer exam)
Prerequisites
- Access to a Google Cloud Platform account is recommended
Before you can actually start using Cloud Dataprep, you need to get familiar with a few basic concepts. In this lesson, I am going to describe the most commonly used components of Dataprep.
First, let’s look at the typical work flow. Here is a chart to demonstrate.
There are three main steps: Ingestion, Preparation, and Analysis. Dataprep will help you cover the first two. So, first you need to ingest data from one or more sources. This will be the raw data you want to use in your final analysis. And this data might be stored in a database, CSV files or other places, such as a Google Spreadsheet.
After loading your data, you will then need to clean it up. This is accomplished by performing sets of transformations on it. Dataprep allows you to define the transformations, and then it handles the execution by using Dataflow or BigQuery under the hood. You won’t need to manage those systems yourself, or write any code.
And then finally, the resulting data will be stored in either a BigQuery database or as files in a Cloud Storage bucket. From here, you can conduct your analysis by using any number of tools.
In order to make the Ingestion and Preparation steps as easy as possible, Dataprep has defined three main components: Flows, Datasets, and Recipes.
A flow defines the relationship between your imported data and transformations. A flow acts like a blueprint for your data pipeline. You specify the data to import, the groups of transformations to execute, and where and how to store the results. Flows act as a container for holding both datasets and recipes.
The simplest example of a flow would contain three objects:
- A dataset (which would be your imported data)
- A recipe (which is a list of transformations)
- An output (which saves the modified dataset)
Flows can be as simple or as complicated as you need. They can contain many different datasets and have many different steps of transformations. Flows can also be chained together as well.
Datasets represent a group of data you will be working with.
Data imported into Dataprep is called an imported dataset. This dataset type represents the original, unchanged data that exists inside a file or database. Imported dataset objects do NOT actually contain any data. They are simply a reference or pointer to the original data. Dataprep will read samples of your data for previewing purposes, but the actual data is never copied onto the Dataprep platform. This means you cannot modify or save data within Dataprep itself.
After importing your data, you will then perform some transformations on it. These modified datasets are called wrangled datasets. A wrangled dataset is basically just an imported dataset with some added instructions for modification. Just like imported datasets, wrangled datasets do not contain any data.
After you have finished modifying a dataset, usually you will want to save it somewhere. You accomplish this by creating an output. An output defines one or more publishing destinations for a wrangled dataset. They determine the format of the data, as well as the location.
I had mentioned previously that you can chain flows together. This is accomplished by using a reference dataset. A reference dataset is a wrangled dataset from one flow that can be directly shared with other flows. There is no need to manually save it to an external database or file. This means you can avoid having to re-import your data and it will save on storage costs as well.
These four types of datasets allow you to easily access, store and share your data as needed. You ingest external data into an imported dataset. You then make some changes to produce a wrangled dataset. And finally, you save your results using an output, or share them using a reference dataset.
In order to transform your datasets, you need to define a series of transformations. These transformations are broken down into steps and stored in objects called recipes. In traditional cooking, you use recipes to turn raw ingredients into edible meals. In the same way, Dataprep recipes allow you to transform your raw data into easily consumable format.
Each recipe needs a dataset to work on. This can either be an imported dataset or a wrangled dataset. The recipe also contains an ordered list of operations to perform on the data. The result will be a new wrangled dataset.
Recipes are defined this way in order to be reusable. You can apply the same recipe on multiple datasets. Also, you can chain multiple recipes together. So, a flow could pass a dataset into recipe 1 and then take the result and pass it into recipe 2.
So, you will be building your pipeline by first creating flows. Flows contain datasets and recipes for transforming the datasets. Once you have completed building your flows, you then execute them in either an ad hoc manner, or create scheduled jobs.
Daniel began his career as a Software Engineer, focusing mostly on web and mobile development. After twenty years of dealing with insufficient training and fragmented documentation, he decided to use his extensive experience to help the next generation of engineers.
Daniel has spent his most recent years designing and running technical classes for both Amazon and Microsoft. Today at Cloud Academy, he is working on building out an extensive Google Cloud training library.
When he isn’t working or tinkering in his home lab, Daniel enjoys BBQing, target shooting, and watching classic movies.