The course is part of this learning path
Google Cloud Dataproc is a managed service for running Apache Hadoop and Spark jobs. It can be used for big data processing and machine learning.
But you could run these data processing frameworks on Compute Engine instances, so what does Dataproc do for you? Dataproc actually uses Compute Engine instances under the hood, but it takes care of the management details for you. It’s a layer on top that makes it easy to spin up and down clusters as you need them.
Learning Objectives
- Explain the relationship between Dataproc, key components of the Hadoop ecosystem, and related GCP services
- Create, customize, monitor, and scale Dataproc clusters
- Run data processing jobs on Dataproc
- Apply access control to Dataproc
Intended Audience
- Data professionals
- People studying for the Google Professional Data Engineer exam
Prerequisites
- Hadoop or Spark experience (recommended)
- Google Cloud Platform account (sign up for free trial at https://cloud.google.com/free if you don’t have an account)
This Course Includes
- 49 minutes of high-definition video
- Many hands-on demos
The github repository is at https://github.com/cloudacademy/dataproc-intro.
Cloud Dataproc’s purpose in life is to run Apache Hadoop and Spark jobs. But you could run these data processing frameworks on Compute Engine instances, so what does Dataproc do for you? Dataproc actually uses Compute Engine instances under the hood, but it takes care of the management details for you. It’s a layer on top that makes it easy to spin up and down clusters as you need them.
The main benefits are that:
- It’s a managed service, so you don’t need a system administrator to set it up.
- It’s fast. You can spin up a cluster in about 90 seconds.
- It’s cheaper than building your own cluster because you can spin up a Dataproc cluster when you need to run a job and shut it down afterward, so you only pay when jobs are running.
- It’s integrated with other Google Cloud services, including Cloud Storage, BigQuery, and Cloud Bigtable, so it’s easy to get data into and out of it.
Dataproc clusters come with these open-source components pre-installed.
If you’re not familiar with these components, their relationships with each other can be confusing. The simplest way to look at them is that Hadoop is the underlying data processing framework and Hive, Pig, and Spark provide different languages to run jobs on Hadoop. It’s not quite as simple as that in Spark’s case, but I’ll explain why in a minute.
The Hadoop core consists of a distributed filesystem called HDFS, a cluster management system called YARN, and a data processing system called MapReduce. To run a MapReduce job, you need to follow its programming model. MapReduce jobs are normally written in Java, but they can be written in other languages as well.
Unfortunately, MapReduce jobs tend to be somewhat difficult to write, so a number of alternatives have been developed. The simplest is HiveQL which is almost the same as SQL. Instead of having to write a complicated Java program, you can just write SQL statements and Hive will convert them into MapReduce for you. With Hive, you can make Hadoop act like a data warehouse. Hive was originally developed at Facebook, but it was later open-sourced.
Next up in terms of complexity is Pig, which has a language called Pig Latin. It’s a scripting language that’s designed for writing data flows, just like MapReduce, but it’s far simpler and easier to read. It was originally developed at Yahoo, but it’s now open source.
Spark is the most complex and the most flexible alternative to MapReduce, but it’s different from Hive and Pig because it also has its own data processing engine. That is, it doesn’t use the MapReduce engine at all. Spark’s biggest advantage is that it can run jobs in memory instead of on disk, so it can be one or two orders of magnitude faster than disk-based MapReduce jobs. Spark can also run machine learning jobs using its MLlib library. You can even run Spark completely on its own, without any of the Hadoop components. That’s why I said earlier that it’s not just a different language for Hadoop.
So where does Cloud Dataproc fit into all of this? It’s integrated with YARN to make cluster management easier.
It’s worth noting that Google has its own alternatives to all of these components. In fact, Google’s internal products are what inspired the open source community to develop similar software.
In 2003, Google released a white paper on the Google File System, and in late 2004, they released a white paper on their internal software called MapReduce. A little over a year later, Apache Hadoop was created. HDFS was similar to the Google File System and they even called the data processing layer MapReduce, just like Google did.
In 2006, Google released a white paper on their internal NoSQL datastore called Bigtable, and this led to the creation of Apache HBase as a component for Hadoop.
Google released Bigtable as Cloud Bigtable in 2015 and made it compatible with the HBase API, so you can use Bigtable as a replacement for HBase when you run Hadoop jobs.
Google also has a complete replacement for Hadoop and Spark called Cloud Dataflow. It’s similar to Spark but it has a programming framework called Beam that’s superior to Spark’s programming model, although Spark is catching up.
If you want to use a fast, managed data warehouse service, then you can use Google BigQuery instead of Hadoop with Hive.
If you want a powerful, managed machine learning service, then you can use Google Cloud Machine Learning Engine instead of Spark with MLlib.
Yet another open-source system that works with Hadoop is Apache Kafka. It’s a fault-tolerant publish and subscribe messaging system. If you need to continuously ingest big streams of data into Hadoop and you want to make sure that none of the data is lost, then Kafka is a good choice. Google has an alternative for this too. It’s called Cloud Pub/Sub and it can handle millions of messages per second.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).