Using Cloud Dataproc
The course is part of this learning path
Google Cloud Dataproc is a managed service for running Apache Hadoop and Spark jobs. It can be used for big data processing and machine learning.
But you could run these data processing frameworks on Compute Engine instances, so what does Dataproc do for you? Dataproc actually uses Compute Engine instances under the hood, but it takes care of the management details for you. It’s a layer on top that makes it easy to spin up and down clusters as you need them.
- Explain the relationship between Dataproc, key components of the Hadoop ecosystem, and related GCP services
- Create, customize, monitor, and scale Dataproc clusters
- Run data processing jobs on Dataproc
- Apply access control to Dataproc
- Data professionals
- People studying for the Google Professional Data Engineer exam
- Hadoop or Spark experience (recommended)
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
This Course Includes
- 49 minutes of high-definition video
- Many hands-on demos
The github repository is at https://github.com/cloudacademy/dataproc-intro.
I hope you enjoyed learning about Google Cloud Dataproc. Let’s do a quick review of what you learned. Dataproc is a managed service for running Apache Hadoop and Spark jobs. It supports Hadoop jobs written in MapReduce (which is the core Hadoop processing framework), Pig Latin (which is a simplified scripting language), and HiveQL (which is similar to SQL). Spark can either run on top of Hadoop or completely on its own. It’s generally faster than Hadoop due to its support for in-memory processing.
Google also has its own alternatives to all of these components. Bigtable is similar to HBase (and is compatible with it). Dataflow is similar to Hadoop and Spark. BigQuery is similar to Hive on Hadoop. Machine Learning Engine is similar to MLib on Spark. And Pub/Sub is similar to Kafka. The biggest difference is that these are all managed services and they typically scale better than the open-source alternatives.
A Dataproc cluster consists of one or more master nodes and at least two worker nodes. You can scale the number of nodes up or down at any time, but the worker nodes all have to have the same configuration. To save money, you can use preemptible nodes, but you always have to have at least two regular worker nodes in every cluster.
Pricing includes a relatively small hourly charge for Dataproc itself plus the cost of the Compute Engine instances in the cluster plus the cost of any other GCP services that are used, such as Cloud Storage or BigQuery.
Access control is handled in IAM by granting users either the Viewer or Editor role and granting service accounts the Worker role.
You can see what’s happening with clusters using either the Dataproc console or Stackdriver Monitoring. You can also use Stackdriver Logging to see the Dataproc audit logs as well as messages from Hadoop.
All Dataproc clusters come with connectors for Cloud Storage, Bigtable, and BigQuery pre-installed.
Dataproc provides two ways to customize the software on your cluster: cluster properties and initialization actions. With cluster properties, you can change config file entries for the software on the nodes. With initialization actions, you can run one or more scripts on the nodes. Both of these customizations only happen when the cluster is created.
Now you know the relationship between Dataproc, key components of the Hadoop ecosystem and related GCP services, how to create, customize, monitor, and scale Dataproc clusters, how to run data processing jobs on Dataproc, and how to apply access control to Dataproc.
To learn more about Cloud Dataproc, you can read Google’s documentation. Also, watch for new big data courses on Cloud Academy, because we’re always publishing new courses.
If you have any questions or comments, please let me know in the Comments tab below this video or by emailing firstname.lastname@example.org. Thanks and keep on learning!
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).