Using Cloud Dataproc
The course is part of this learning path
Google Cloud Dataproc is a managed service for running Apache Hadoop and Spark jobs. It can be used for big data processing and machine learning.
But you could run these data processing frameworks on Compute Engine instances, so what does Dataproc do for you? Dataproc actually uses Compute Engine instances under the hood, but it takes care of the management details for you. It’s a layer on top that makes it easy to spin up and down clusters as you need them.
- Explain the relationship between Dataproc, key components of the Hadoop ecosystem, and related GCP services
- Create, customize, monitor, and scale Dataproc clusters
- Run data processing jobs on Dataproc
- Apply access control to Dataproc
- Data professionals
- People studying for the Google Professional Data Engineer exam
- Hadoop or Spark experience (recommended)
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
This Course Includes
- 49 minutes of high-definition video
- Many hands-on demos
The github repository is at https://github.com/cloudacademy/dataproc-intro.
Letting Dataproc do all the work configuring a cluster is great, but what if you need a custom Hadoop or Spark implementation? There are a couple of ways to do this. The first way is to log into the master node of the cluster and make whatever changes you like. This could include changing the configuration files for Hadoop or Spark and uploading scripts or other files to the node.
You could also run PySpark interactively instead of feeding it a script through Dataproc. One disadvantage of doing things this way is that there’ll be no record in Dataproc of any jobs that you run, so you can’t go back and look at the configurations and results unless you save them somewhere else yourself.
There is a bit of a record of what you’ve done while the cluster is running. It’s in the YARN console. It takes a bit of work to get to it, though, because you have to create an SSH tunnel to the master node and use a SOCKS proxy in your browser, which is kind of a pain.
I’ll show you how to do it, but you don’t need to follow along on your own system. First, you create the SSH tunnel with this “gcloud compute ssh” command
gcloud compute ssh --zone=us-east1-d \
--ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" cluster-be75-m
You have to fill in the right zone here and then put in some ssh flags and then put in the hostname of your master node here.
Then, in your browser, you go to your settings, and then the advanced settings, and open your proxy settings. Check “SOCKS proxy” and set the proxy server to be localhost:1080. Then for the URL, you put in the hostname of the master node and then colon 8088. Now we can finally see the YARN console. Personally, I’d much rather just use the Dataproc console.
OK, so how do you customize your cluster without going onto the master node? There are two ways to make changes: cluster properties and initialization actions. We actually set a cluster property already when we set up the Stackdriver Monitoring agent. But that was a special kind of property because it applied to a Google service. Normally, cluster properties are used to make changes to the configuration files for Hadoop, Spark, Hive, or Pig.
For example, you could set the spark.driver.maxResultSize property to 2 gig by adding this argument when you create the cluster. First, you have to tell it which config file you’re modifying. In this case, it’s the spark-defaults.conf file, which is abbreviated as “spark”. There’s a list of the config files you can modify and what prefix to use on this web page.
If you need to set more than one property, then you can separate them with commas, like this.
You can only set properties when you create the cluster and you can only do it from the command line or the API. It’s not supported in the web console yet.
The other type of customization you can do is to set initialization actions. This is a way to run a script on your cluster when it starts up. You just need to put a script on Cloud Storage and then specify that file in the initialization actions parameter. You can do this from the web console, the command line, or the API.
Here’s where it is in the web console. Just type the full path to the script on Cloud Storage. If you need to add more scripts, hit Enter after each one and it’ll give you another field to fill in.
The script will be run on every node, so if you only want it to run on the master node, then you have to put some logic in the script to check which node you’re on. Here’s an example. It retrieves the node’s role from metadata and if it’s the master, then it executes the code. Google provides lots of examples of useful initialization scripts in this github repository.
And that’s it for customization.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).