Using Cloud Dataproc
The course is part of this learning path
Google Cloud Dataproc is a managed service for running Apache Hadoop and Spark jobs. It can be used for big data processing and machine learning.
But you could run these data processing frameworks on Compute Engine instances, so what does Dataproc do for you? Dataproc actually uses Compute Engine instances under the hood, but it takes care of the management details for you. It’s a layer on top that makes it easy to spin up and down clusters as you need them.
- Explain the relationship between Dataproc, key components of the Hadoop ecosystem, and related GCP services
- Create, customize, monitor, and scale Dataproc clusters
- Run data processing jobs on Dataproc
- Apply access control to Dataproc
- Data professionals
- People studying for the Google Professional Data Engineer exam
- Hadoop or Spark experience (recommended)
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
This Course Includes
- 49 minutes of high-definition video
- Many hands-on demos
The github repository is at https://github.com/cloudacademy/dataproc-intro.
To make it easy for Dataproc to access data in other GCP services, Google has written connectors for Cloud Storage, Bigtable, and BigQuery. These connectors are automatically installed on all Dataproc clusters.
Connecting to Cloud Storage is very simple. You just have to specify a URL starting with gs:// and the name of the bucket. You can use this anywhere in your Hadoop and Spark programs where you would normally use a reference to files on HDFS.
Connecting to Bigtable is also quite easy because it is compatible with HBase. So anywhere that you make an HBase reference, you can simply specify a table in Bigtable instead of in HBase. There’s an example of this in my “Introduction to Google Cloud Bigtable” course.
Using the BigQuery connector is slightly more work. You need to use some Hadoop classes that are specific to BigQuery. They’re in com.google.cloud.hadoop.io.bigquery. I’ll show you an example in a bit, but if you want more detailed information, the documentation is at this address.
To see the BigQuery connector in action, we’re going to run a more complex job than the SparkPi and WordCount programs. This time, we’ll run a machine learning program that reads in data about thousands of births and trains a model to predict the birth weights of babies in the future, based on characteristics such as the ages of the parents.
There’s a public dataset in BigQuery that has data on millions of births. It’s called natality. That’s where we’re going to get the training data.
First, we’ll run a program that extracts 10,000 birth records. It makes sure that records are clean too. That is, it only extracts records that have data in all 5 of the fields that we want to use in our model. It writes these clean records to a new BigQuery table.
You can download the program by cloning my github repository. Copy this URL. Then go to Cloud Shell and type “git clone” and then paste the URL.
Now go into the dataproc-intro directory. You can just hit Tab after typing the first character and it’ll fill in the rest.
Before you can run the script, you need to set an environment variable. Type “export GCLOUD_PROJECT=” and then type your project name. The easiest way to get it is to go to the home page of the Cloud Console. You can copy it from here. Then paste it. Now run this command: “python natality.py”.
Go to the BigQuery console and have a look. There should be a dataset called “natality_regression” and there should be a table inside the dataset called “regression_input”. Click on it and then click on Preview.
You should see records with the 5 columns needed for the machine learning model, as well as the birth weight. There should be 10,000 records in this table.
Now that the input data’s ready, we can spin up a cluster and run the machine learning job. Go to the Clusters page on the Dataproc console and create a cluster. You can leave it with the default settings.
While it’s spinning up, let’s have a quick look at the code. Here’s the section that reads the records from BigQuery into a Spark RDD (which stands for Resilient Distributed Dataset). This statement is what reads the data, and all of the other ones before it set up the configuration that it needs, such as the name of the BigQuery table. I’m not going to go through this in detail, but the code is pretty straightforward, so it should be quite helpful if you need to read from BigQuery in your own PySpark job.
When it’s done spinning up, go to the Jobs page and click “Submit Job”. Select the cluster. Set the job type to PySpark. For the main Python file, paste this Cloud Storage file path. That’s all you need to fill in, so you can submit it now.
Click on the job to see what’s happening. It runs the 10,000 records through the neural network pretty quickly, so the job will be done soon.
OK, it’s done. If you’re not familiar with neural networks, then these results probably won’t mean much to you, but I’ll just point out one result. The R squared value is a number between 0 and 1 and it’s basically a measure of how well the model fits the input data. A 0 means it doesn’t fit the data at all and a 1 means it fits the data perfectly. You’d normally want it to be fairly high, so it has a good chance of predicting birth weights in the future, but you don’t want it to be too high. If it were a 1, then you’d almost certainly have an overfitting problem, which means that the model’s not generalized for anything other than this particular input data.
If you’d like to learn more about neural networks and you haven’t taken my “Introduction to Google Cloud Machine Learning Engine” course, then that would be a good place to start.
Before you go, remember to delete the cluster. You’ll probably want to delete the BigQuery dataset too. First delete the table. Then delete the dataset.
And that’s it for this lesson.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).