A Simple Pipeline
More Complex Pipelines
Dealing with Time
The course is part of this learning path
Most organizations are already gathering and analyzing big data or plan to do so in the near future. One common way to process huge datasets is to use Apache Hadoop or Spark. Google even has a managed service for hosting Hadoop and Spark. It’s called Cloud Dataproc. So why do they also offer a competing service called Cloud Dataflow? Well, Google probably has more experience processing big data than any other organization on the planet and now they’re making their data processing software available to their customers. Not only that, but they’ve also open-sourced the software as Apache Beam.
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
In this course, you will learn how to write data processing programs using Apache Beam and then run them using Cloud Dataflow. You will also learn how to run both batch and streaming jobs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
- Write a data processing program in Java using Apache Beam
- Use different Beam transforms to map and aggregate data
- Use windows, timestamps, and triggers to process streaming data
- Deploy a Beam pipeline both locally and on Cloud Dataflow
- Output data from Cloud Dataflow to Google BigQuery
The Github repository is at https://github.com/cloudacademy/beam.
So far, we’ve only used the Count transform, which is about as simple as a transform can be. Now we’ll look at how to build more complex transforms.
On the Beam website, there’s a walkthrough of a program called MinimalWordCount. It’s basically the same as MinimalLineCount except that, as you’ve probably guessed, it counts words instead of lines. The reason I didn’t start with MinimalWordCount is because it contains a custom transform to split the lines into words.
Click on “MinimalWordCount.java”. This looks a lot longer than LineCount, but it’s mostly comments.
It does takes considerably longer to run than LineCount, though, and since the Cloud Shell VM has a very small amount of CPU and memory, it’s not suitable for running this program. So we’re going to run it on Dataflow. But to do that, we need to accept command-line arguments for the runner and the output bucket. I’ve added essentially the same code that’s in LineCount here. Now it’ll run on Dataflow.
Here’s the transform that splits each line into words. It uses the ParDo class, which stands for “Parallel Do” because it gives you a way to run an operation on many elements of data in parallel, just like the pre-written transforms do.
You need to call its “of” method and pass it a Do Function with the input type and output type, which is a String for both in this case. Then you override the processElement method, which takes a ProcessContext parameter. ProcessContext contains all of the information needed to process an element.
Now we can finally put in the custom code for this transform. Here it splits the element (which is a line of text) into words using a regular expression. Then for each word in the line, if it’s not empty, it adds it to the output, which will be put in a PCollection.
When you create your own custom transforms, you can put in whatever code you want. Just perform some operation on “element” and save the transformed version of it as “output”.
Oh, and one more thing. You might be wondering about this ExtractWords text. It’s simply a name you can assign to this apply statement. You don’t have to name your applies, but it can be helpful because Dataflow will show this name in logs and on the pipeline graph in the console. I’ll show you an example in the next lesson.
The next transform is Count, but this one is a bit different from what we did in the LineCount program, which used the “globally” method. This one uses the perElement method. It does a count on each unique element in the PCollection, which in this case is a word in King Lear. For example, if there are 6 copies of the word “happy” in the PCollection, then it will reduce them down to a key/value pair of “happy” and 6.
Now, here’s the transform that converts the previous PCollection into one that can be printed to a text file. This is also more complicated than the one in the LineCount program, which only had to print one number. This one has to print key/value pairs, so for each element, it creates a string containing a key, a colon, and a value.
First, go into the beam/examples/java8 directory. Then paste this command.
When it’s done, go to your bucket in Cloud Storage. You’ll see that it created a number of output files starting with “wordcounts”. Have a look at one of them. You’ll see a different list of words than this because the Beam model doesn’t guarantee the order of elements.
By the way, it’s actually possible to write this word count program without using ParDo. In this version, called MinimalWordCountJava8, it uses the FlatMapElements transform to split each line into words. You still have to tell it how to do the split, though, so it’s not a completely out-of-the-box transform. Then it uses the Filter transform to remove empty words, which was the second operation performed by the custom transform in the original version of this program.
Anyway, you can often choose between using ParDo or a more specific transform, such as FlatMapElements. The choice usually comes down to readability of the code. In this example, the two versions take about the same amount of time to run.
And that’s it for this lesson.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).