Google Cloud Dataproc is a managed service for running Apache Hadoop and Spark jobs. It can be used for big data processing and machine learning.
But you could run these data processing frameworks on Compute Engine instances, so what does Dataproc do for you? Dataproc actually uses Compute Engine instances under the hood, but it takes care of the management details for you. It’s a layer on top that makes it easy to spin up and down clusters as you need them.
- Explain the relationship between Dataproc, key components of the Hadoop ecosystem, and related GCP services
- Create, customize, monitor, and scale Dataproc clusters
- Run data processing jobs on Dataproc
- Apply access control to Dataproc
- Data professionals
- People studying for the Google Professional Data Engineer exam
- Hadoop or Spark experience (recommended)
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
This Course Includes
- 49 minutes of high-definition video
- Many hands-on demos
The github repository is at https://github.com/cloudacademy/dataproc-intro.
Welcome to the “Introduction to Google Cloud Dataproc” course. I’m Guy Hummel, the Google Cloud Content Lead at Cloud Academy, and I’ll be showing you how to use this big data processing service. If you have any questions, feel free to connect with me on LinkedIn and send me a message, or send an email to email@example.com.
This course is intended for data professionals, especially those who need to design and build big data processing systems. This is an important course to take if you’re studying for the Google Professional Data Engineer exam.
To get the most from this course, it would be helpful to have some experience with either Hadoop or Spark, although it’s not strictly required. This is a hands-on course with lots of demonstrations. The best way to learn is by doing, so I recommend that you try performing these tasks yourself on your own Google Cloud account. If you don’t have one, then you can sign up for a free trial.
Cloud Dataproc is a managed service for running Apache Hadoop and Spark jobs. It can be used for big data processing and machine learning.
In this course, we’ll start with an overview of Dataproc, the Hadoop ecosystem, and related Google Cloud services.
Next, I’ll show you how to create a cluster, run a simple job, and see the results. I’ll also explain Dataproc pricing.
Then we’ll go over how to increase security through access control.
After that, you’ll see how to scale up a cluster with regular and preemptible nodes, and watch its performance with Stackdriver Monitoring.
Then, I’ll take you through an example of running a Spark job that reads data from BigQuery.
And finally, I’ll show you how to customize the software on your Dataproc clusters.
By the end of this course, you should be able to explain the relationship between Dataproc, key components of the Hadoop ecosystem, and related GCP services, create, customize, monitor, and scale Dataproc clusters, run data processing jobs on Dataproc, and apply access control to Dataproc.
We’d love to get your feedback on this course, so please let us know what you think on the Comments tab below or by emailing firstname.lastname@example.org.
Now, if you’re ready to learn how to get the most out of Cloud Dataproc, then let’s get started.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).