Most organizations are already gathering and analyzing big data or plan to do so in the near future. One common way to process huge datasets is to use Apache Hadoop or Spark. Google even has a managed service for hosting Hadoop and Spark. It’s called Cloud Dataproc. So why do they also offer a competing service called Cloud Dataflow? Well, Google probably has more experience processing big data than any other organization on the planet and now they’re making their data processing software available to their customers. Not only that, but they’ve also open-sourced the software as Apache Beam.
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
In this course, you will learn how to write data processing programs using Apache Beam and then run them using Cloud Dataflow. You will also learn how to run both batch and streaming jobs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
Learning Objectives
- Write a data processing program in Java using Apache Beam
- Use different Beam transforms to map and aggregate data
- Use windows, timestamps, and triggers to process streaming data
- Deploy a Beam pipeline both locally and on Cloud Dataflow
- Output data from Cloud Dataflow to Google BigQuery
Resources
The Github repository for this course can be found at https://github.com/cloudacademy/beam.
Welcome to the “Introduction to Google Cloud Dataflow” course. My name’s Guy Hummel and I’ll be showing you how to process huge amounts of data in the cloud. I’m the Google Cloud Content Lead at Cloud Academy and I’m a Google Certified Professional Cloud Architect and Data Engineer. If you have any questions, feel free to connect with me on LinkedIn and send me a message, or send an email to support@cloudacademy.com.
This course is intended for data professionals, especially those who need to design and build big data processing systems. This is an important course to take if you’re studying for the Google Professional Data Engineer exam.
To get the most from this course, you should have experience with Java, because I’ll be showing you lots of examples of code written in Java. I’ll also show you how to run these examples on the Dataflow service, so I recommend that if you don’t already have a Google Cloud account, then sign up for a free trial. It’s good for a year and lets you run up to $300 worth of services.
Cloud Dataflow executes data processing pipelines. A pipeline is a sequence of steps that reads data, transforms it in some way, and writes it out. Since Dataflow is designed to process very large datasets, it distributes these processing tasks to a number of virtual machines in a cluster, so they can process different chunks of the data in parallel.
Cloud Dataflow is certainly not the first big data processing engine. It’s not even the only one available on Google Cloud Platform. For example, one alternative is to run Apache Spark on Google’s Dataproc service. So why would you choose Dataflow? There are a few reasons.
First, it’s essentially serverless. That is, you don’t have to manage the compute resources yourself. Dataflow will automatically spin up and down clusters of virtual machines when you run processing jobs. You can just focus on writing the code instead of building clusters. Apache Spark, on the other hand, requires more configuration, even if you run it on Cloud Dataproc.
Second, Google has separated the processing code from the environment where it runs. In 2016, they open sourced Dataflow’s Software Development Kit, which was released as Apache Beam. Now you can write Beam programs and run them on your own systems or on the Cloud Dataflow service. In fact, if you look at Google’s Dataflow documentation, you’ll see that it tells you to go to the Apache Beam website for the latest version of the Software Development Kit.
Third, it was designed to process data in both batch and streaming modes with the same programming model. This is a big deal. Other big data SDKs typically require that you use different code depending on whether the data comes in batch or streaming form. Competitors like Spark are addressing this, but they’re not quite there yet.
We’ll start with how to build and execute a simple pipeline locally. Then I’ll show you how to run it on Cloud Dataflow.
Next, we’ll look at how to build more complex pipelines using custom and composite transforms.
Finally, I’ll show you how to deal with time, using windows and triggers. You’ll also see how to integrate a pipeline with Google BigQuery.
By the end of this course, you should be able to write a data processing program in Java using Apache Beam; use different Beam transforms to map and aggregate data; use windows, timestamps, and triggers to process streaming data; deploy a Beam pipeline both locally and on Cloud Dataflow; and output data from Cloud Dataflow to Google BigQuery.
We would love to get your feedback on this course, so please let us know what you think on the Comments tab below or by emailing support@cloudacademy.com.
Now, if you’re ready to learn how to get the most out of Dataflow, then let’s get started.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).