Google Cloud Dataproc is a managed service for running Apache Hadoop and Spark jobs. It can be used for big data processing and machine learning.
But you could run these data processing frameworks on Compute Engine instances, so what does Dataproc do for you? Dataproc actually uses Compute Engine instances under the hood, but it takes care of the management details for you. It’s a layer on top that makes it easy to spin up and down clusters as you need them.
- Explain the relationship between Dataproc, key components of the Hadoop ecosystem, and related GCP services
- Create, customize, monitor, and scale Dataproc clusters
- Run data processing jobs on Dataproc
- Apply access control to Dataproc
- Data professionals
- People studying for the Google Professional Data Engineer exam
- Hadoop or Spark experience (recommended)
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
This Course Includes
- 49 minutes of high-definition video
- Many hands-on demos
What You’ll Learn
- Introduction: An introduction to the course
- What Is Cloud Dataproc?: Dataproc, Hadoop ecosystem, and related GCP services
- Running a Simple Job: A hands-on demo of running a Spark job on Dataproc
- Access Control: How to assign roles to control access to Dataproc
- Scaling a Cluster: A hands-on demo of adding nodes to a cluster for a bigger job
- Connecting to BigQuery: A hands-on demo of using BigQuery with a PySpark job
- Customization: How to customize software on the cluster nodes
- Conclusion: Review of key points
The github repository is at https://github.com/cloudacademy/dataproc-intro.
Do you have a question about this course? You can ask it in the Comments tab above, or email us at firstname.lastname@example.org.