Using Cloud Dataproc
The course is part of this learning path
Google Cloud Dataproc is a managed service for running Apache Hadoop and Spark jobs. It can be used for big data processing and machine learning.
But you could run these data processing frameworks on Compute Engine instances, so what does Dataproc do for you? Dataproc actually uses Compute Engine instances under the hood, but it takes care of the management details for you. It’s a layer on top that makes it easy to spin up and down clusters as you need them.
- Explain the relationship between Dataproc, key components of the Hadoop ecosystem, and related GCP services
- Create, customize, monitor, and scale Dataproc clusters
- Run data processing jobs on Dataproc
- Apply access control to Dataproc
- Data professionals
- People studying for the Google Professional Data Engineer exam
- Hadoop or Spark experience (recommended)
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
This Course Includes
- 49 minutes of high-definition video
- Many hands-on demos
The github repository is at https://github.com/cloudacademy/dataproc-intro.
Dataproc has a very simple access control model. You can give users either the Viewer or Editor role. A Viewer can see clusters, jobs, and operations, but can’t do anything else. That is, a Viewer can’t create, update, or delete clusters, and can’t submit, cancel, or delete jobs. An Editor can do all of those things.
These permissions only apply to an entire project, so you can’t, for example, give a person Editor access to a particular cluster. It’s all or nothing for every cluster in a project.
You can assign one of these roles in IAM in either the Dataproc menu or the Project menu. The difference is that if you assign a role at the project level, then the user will have those permissions across all Google Cloud Platform services. For example, if you assign a user the Editor role at the project level, then they’ll have edit permissions for Compute Engine and Cloud Storage as well.
There’s actually one other Dataproc role, but it’s normally assigned only to service accounts. It’s the Worker role. It has completely different permissions than the other two roles because it’s used to execute jobs. The permissions are related to things like tasks, logs, and storage.
A service account is what an application uses to authenticate itself with other GCP services. For example, if a Hadoop application needs to use Cloud Storage, then it has to authenticate with Cloud Storage first. It uses a service account to do that. The service account has been granted a Cloud Storage role, which says what the service account is allowed to do with Cloud Storage.
The VMs in a Dataproc cluster use the Compute Engine default service account. If you want to use a different service account, then you can specify it when you create the cluster. However, at the moment, you can only do this using either the gcloud command or the Dataproc API. You can’t do it in the web console yet.
Why would you want to use a different service account? You’d use one if you wanted to give different Dataproc clusters different levels of access to resources. For example, if you kept sensitive data in BigQuery, then you might want to only give certain clusters access to it. Since Dataproc clusters have access to BigQuery by default, you’d have to assign a special service account to clusters when you created them, so they wouldn’t have access to BigQuery.
When you create a special service account for Dataproc to use, then you have to grant the Worker role to it so it can execute jobs.
And that’s it for access control.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).