Using Cloud Dataproc
The course is part of this learning path
Google Cloud Dataproc is a managed service for running Apache Hadoop and Spark jobs. It can be used for big data processing and machine learning.
But you could run these data processing frameworks on Compute Engine instances, so what does Dataproc do for you? Dataproc actually uses Compute Engine instances under the hood, but it takes care of the management details for you. It’s a layer on top that makes it easy to spin up and down clusters as you need them.
- Explain the relationship between Dataproc, key components of the Hadoop ecosystem, and related GCP services
- Create, customize, monitor, and scale Dataproc clusters
- Run data processing jobs on Dataproc
- Apply access control to Dataproc
- Data professionals
- People studying for the Google Professional Data Engineer exam
- Hadoop or Spark experience (recommended)
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
This Course Includes
- 49 minutes of high-definition video
- Many hands-on demos
The github repository is at https://github.com/cloudacademy/dataproc-intro.
If you’ve taken one of my Dataflow or Bigtable courses, then this example job will be familiar. It’s a word count program. We’re going to run a Spark job that counts the number of occurrences of each word in the book “The Prince” by Machiavelli.
Before we get started, you’ll need to make sure you have the Cloud Dataproc API enabled. In the Google Cloud Console, go to the “APIs and services” page. If “Google Cloud Dataproc API” is in the list, then it’s already enabled. While you’re here, also check that “Google Cloud Storage”, the “Google Compute Engine API”, and the “BigQuery API” are enabled. You won’t need BigQuery for this example, but you will later on in the course, so you might as well enable it now too.
If any of those 4 APIs aren’t enabled, then click “Enable APIs & Services”. In my case, the Dataproc API isn’t enabled. I’ll type “Dataproc” in the search box. Only one API comes up, so I’ll click on it. And I’ll enable it.
Alright, back to the word count example. You can submit a Dataproc job using the web console, the gcloud command, or the Cloud Dataproc API. We’re going to use the web console this time.
In the console, select Dataproc from the menu. The first thing we have to do is create a cluster to run the job on. Click the “Create cluster” button. You can name your cluster something else if you want, but I’m going to just leave it with the default name it chose.
The region setting is to choose the location of the endpoint for the Dataproc service. You can either use the single global endpoint or choose a specific regional endpoint. It usually doesn’t matter too much whether you leave this at global or change it to a specific region, so I’ll leave it at global.
I’ll choose us-central1-c for the zone, but you can pick one that’s close to you.
There are lots of choices for the machine type, but since this will be a pretty small job, you can choose the smallest machine type, which only has a single virtual CPU and about 4 gig of memory.
For the cluster mode, we could select “Single Node”, which would mean that there would only be one master node and no worker nodes. That would probably be fine for this job, but to see how a cluster normally works, let’s choose “Standard”.
Leave the primary disk size at 500. Since we chose a Standard cluster mode, now we have to configure at least 2 worker nodes. Set the machine type to the smallest again. You can leave everything else with the defaults and click the “Create” button.
It usually takes a minute or so to spin up the cluster. While it’s spinning up, let’s get the job ready. Click on “Jobs”. Then click “Submit Job”.
Leave the region at “global”. There’s only one choice for the cluster, so select it. In the Job type menu, you’ll see that there are quite a few choices. As I mentioned earlier, you can run Hadoop, Spark, Hive, and Pig jobs, but there are a couple more choices here too. There’s SparkSql and PySpark. SparkSql is a little bit like Hive because it allows you to use SQL statements, but it’s within the Spark programming framework, so it’s not as easy to use as Hive. It is more powerful, though, since Spark gives you more flexibility. We’re going to run a regular Spark job, so select that.
What you need to enter in some of these fields will require quite a bit of typing, so I’ve created a file on github where you can copy them from. The repository’s at this address. If you don’t want to type in the github URL, then you can find a link to it near the bottom of the “About this course” tab below.
Here’s what you need to put in the jar files field. Copy it and paste it here. This jar file is on the master node in the cluster. It’s already there because it comes with the Spark installation. In the Main class or jar field, paste this.
This Arguments field is for arguments to the Spark job itself rather than to Dataproc. This job takes one argument that specifies what file to count the words in. Paste this, which is the book “The Prince”. Now click “Submit”.
You can see that the job is running. It should take less than a minute, even with the small cluster nodes we chose. I’ll fast forward to when it’s done.
It says that the job succeeded. If you click on the job ID, it’ll bring up the logs. It contains the output, which is a list of all of the words in the book and how many times they occurred. For example, the word “always” occurred 80 times.
You’ll also see other log entries here, like these. This job threw an exception, but it still ran, so I guess it wasn’t a problem.
Now that the job is finished, we should delete the cluster. Click on “Clusters”. Then check the box next to the cluster ID and click “Delete”. This is one of the great things about Dataproc. You can spin clusters up and down as you need them, so you won’t waste money on idle clusters.
If we did leave the cluster running, how much would it cost? The pricing for Dataproc itself is quite simple. For the smallest machine type, it’s one cent per hour. (It’s actually charged per second, though, with a 1-minute minimum.) So if we left the cluster running for a month, the Dataproc cost would be 3 nodes times 1 cent per hour times 24 hours per day times 30 days per month, which comes to $21.60.
However, you’ll also get charged for the underlying Compute Engine resources. For the smallest machine type, the price is 4.75 cents per hour. So for one month, the cost would be 3 nodes times 4.75 cents per hour times 24 hours times 30 days, which comes to $102.60.
You’ll also have to pay for the 500 gig of persistent disk attached to each of the 3 nodes. The cost is 4 cents per gig per month, so the cost would be 3 nodes times 500 gig times 4 cents per gig, which comes to 60 dollars. So the total cost is $184.20.
So Dataproc itself is only a small fraction of the total cost, while Compute Engine makes up the vast majority of the cost. By the way, did you notice that preemptible instances are only about one-fifth the cost of regular instances. I’ll show you how to use those with Dataproc later in the course.
And that’s it for this lesson.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).