The course is part of this learning path
Google Cloud Dataproc is a managed service for running Apache Hadoop and Spark jobs. It can be used for big data processing and machine learning.
But you could run these data processing frameworks on Compute Engine instances, so what does Dataproc do for you? Dataproc actually uses Compute Engine instances under the hood, but it takes care of the management details for you. It’s a layer on top that makes it easy to spin up and down clusters as you need them.
Learning Objectives
- Explain the relationship between Dataproc, key components of the Hadoop ecosystem, and related GCP services
- Create, customize, monitor, and scale Dataproc clusters
- Run data processing jobs on Dataproc
- Apply access control to Dataproc
Intended Audience
- Data professionals
- People studying for the Google Professional Data Engineer exam
Prerequisites
- Hadoop or Spark experience (recommended)
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
This Course Includes
- 49 minutes of high-definition video
- Many hands-on demos
The github repository is at https://github.com/cloudacademy/dataproc-intro.
As you’ve seen, spinning up a Hadoop or Spark cluster is very easy with Cloud Dataproc, and scaling up a cluster is even easier. To try this out, we’re going to run a job that’s more resource intensive than WordCount. It’s a program that estimates the value of pi. I admit that this is kind of a silly example because it’s not a very useful task, but it is resource-intensive, so it’s handy for our purposes.
We’re going to run this job twice, the first time with the default cluster configuration and the second time with more nodes.
Go to the Clusters page on the Dataproc console and click “Create cluster”. This time, let’s call it “cluster1”. Now leave everything else with the defaults and click the “Create” button.
While it’s spinning up, let’s get the job ready. Go to the Jobs page and click “Submit Job”. Select the cluster. Change the job type to “Spark”. Now copy the pathname of the jar file from my github file. Then copy the name of the main class and paste it here. Now, in the Arguments field, type 100000. Don’t include a comma or a period. Then submit the job.
To explain what the argument means, I’ll have to give you some background on how the program estimates the value of pi. Imagine you have a circle inside a square with a length of 1 and a width of 1. The program then randomly “throws darts” at the square. Then it counts how many darts landed in the circle. It does this plugging each dart’s x and y values into the left side of the formula for this circle, which is x2 + y2 = 1. If it’s less than 1, then it’s inside the circle. Since the number of darts that should fall inside the circle is pi over 4, it can estimate pi by doing some simple math.
Now back to the argument. It specifies how many darts to throw. Actually, it says how many sets of 100,000 darts it should throw. So we specified 100,000 sets of 100,000 darts, which comes to 10 billion darts! Now you know why this program is so resource intensive. The program distributes these sets of 100,000 darts to the nodes in the cluster. So if we add more nodes, it should run faster.
OK, let’s see how the job’s doing. Click on the job. There’s a bit of output, but not a lot. That’s deceiving, though, because if you scroll to the right, you’ll see there’s a lot more output. It’s pretty hard to read by scrolling, so check the “Line wrapping” box instead. It’s steadily going through all the sets of random numbers.
We can also look at what’s happening with the cluster. Go back to the Clusters page and click on cluster1. It shows what percentage of the total CPU capacity of the cluster is being consumed. You can look at some other metrics too.
Another way to watch the performance of a cluster is to use Stackdriver Monitoring. Select it from the console menu. Then click on “Groups” and “Create Group”. For the Group Name, type the name of the cluster, which is cluster1 in this case. Put cluster1 in the filter field as well, so it’ll know what resource to monitor.
It automatically creates a graph showing the CPU usage on the 3 nodes in the cluster. The master node has the lowest usage. You can look at other metrics by clicking the “Add Chart” button. When you click in the “Metric Type” field, it gives you 5 metrics to choose from. When you click Save, it adds the chart below the first one.
You can do lots of other things with Stackdriver Monitoring too, like creating a dashboard, configuring uptime checks, and setting up alerts.
OK, the job should be done by now, so let’s have a look. Go back to the Dataproc console and click Jobs. It says that it succeeded and it took 3 minutes and 24 seconds. Click on the job and check the Line wrapping box again. Scroll to the bottom. There’s the estimate for the value of pi.
Now let’s try scaling up this cluster and run the job again. Go to the Clusters page and click on the cluster. Then click the Configuration tab. It shows the current configuration. Click the Edit button. To increase the number of nodes, you can just change the number in this field. Before we do that, though, there’s another option to consider.
We can use preemptible worker nodes. As you’ll recall, preemptible nodes only cost about one-fifth the price of regular nodes, so they’re a really good choice when you need to scale up. There are some disadvantages, though. Google Cloud can remove them at any time if it needs the resources. Due to the uncertain nature of preemptible nodes, they cannot store any data, so you won’t lose any data if they’re removed. They can only be used for processing.
It’s also not possible to have a cluster with only preemptible nodes in it. You must have at least 3 regular nodes: the master and the first 2 workers.
The SparkPi job doesn’t need to store data on the nodes, so we can add preemptible nodes. Let’s add 4 of them. They’ll have the same configuration as the two regular worker nodes except that the disk will be smaller since they don’t need much local storage.
OK, now click Save, and that’s all you need to do to add more nodes. It takes a little while for them to spin up, though, and sometimes it takes longer for preemptible nodes if there aren’t enough spare resources available. Click on the VM instances tab to see them spinning up.
I should mention that we could have added these nodes while the previous job was running, so we wouldn’t have had to wait until the job was done to scale up the cluster. I didn’t do that for a couple of reasons. First, I wanted you to see the difference in how long it takes for the job to run with more nodes. And second, when you add nodes in the middle of a job, it might not provide any benefit if the job would have been done fairly soon anyway. Since it takes a while for the new nodes to spin up and for Spark to start using them, it wouldn’t have helped with our last job. But it can be very helpful for a long-running job, of course.
I’ll fast forward to when these extra nodes are ready. OK, they’re ready. By the way, you might have to click the Refresh button to see that the nodes have finished spinning up. Alright, now there’s an easy way to submit the job again. Go to the Jobs page, click on the previous job, and then click “Clone”. This creates a new job with the same configuration. Just click the Submit button. Now I’ll fast forward to when the job’s done.
Alright, it’s finished and it only took 1 minute and 10 seconds. That’s significantly faster than the last time, so the extra nodes obviously helped. That isn’t always the case. Sometimes a job will even take longer with more nodes if it has a bottleneck in it somewhere.
Removing nodes is just as easy. Go back to the Cluster page, click on the cluster, go to the Configuration tab, click the Edit button, and change the number of preemptible worker nodes back to zero.
There’s a way to remove nodes more gently called graceful decommissioning, but it’s still in beta at the moment. It removes a node only after all of its current tasks are completed. There isn’t a way to do it in the web console yet, so you have to use either gcloud or the API. With gcloud, you add this parameter when you do an update.
There’s one more thing I want to show you in this lesson. It’s possible to feed more detailed metrics to Stackdriver Monitoring. To do that, you need to run the Stackdriver Monitoring Agent on the cluster nodes. You can only configure that when you create the cluster, though, so let’s create another one.
You can’t configure it from the web console, so we’ll have to do it from the command line. Go to Cloud Shell and copy and paste this command. The first part is the command to create a cluster. The parameter after that enables the Stackdriver Monitoring Agent. It asks you which zone you want it in, so pick one. We could have specified a zone in the command, but it’s easier this way.
Now go back to Stackdriver Monitoring and create a new group. Put “cluster2” in these two fields and add the group. You usually have to wait for a little while until the monitoring agents make contact, so I’ll fast forward. OK, click “Add Chart”. Now when you click in the “Metric type” field, it gives you a lot more options. There’s a section called “Agent” that includes metrics like CPU load average and zombie processes.
By the way, you can also enable the Stackdriver Logging Agent. Even if you don’t do this, Dataproc will still send lots of log entries to Stackdriver. Like almost every other Google Cloud Platform service, Dataproc writes to the audit logs in Stackdriver. There are two audit logs: Admin Activity and Data Access. These logs help you answer the questions of "who did what, where, and when?" Admin activity includes things like Dataproc clusters getting created and deleted.
Dataproc also sends lots of other log entries to Stackdriver, such as messages from Hadoop. So why would you need to enable the Stackdriver Logging Agent? Well, it serves a different purpose. It runs on VMs and sends detailed server metrics to Stackdriver Logging, similar to what the Monitoring Agent sends to Stackdriver Monitoring. So if you need to analyze the nitty-gritty details of how the VMs in a Dataproc cluster are performing, you can enable the logging agent.
Before we go, we should delete the clusters. There.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).