Most organizations are already gathering and analyzing big data or plan to do so in the near future. One common way to process huge datasets is to use Apache Hadoop or Spark. Google even has a managed service for hosting Hadoop and Spark. It’s called Cloud Dataproc. So why do they also offer a competing service called Cloud Dataflow? Well, Google probably has more experience processing big data than any other organization on the planet and now they’re making their data processing software available to their customers. Not only that, but they’ve also open-sourced the software as Apache Beam.
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
In this course, you will learn how to write data processing programs using Apache Beam and then run them using Cloud Dataflow. You will also learn how to run both batch and streaming jobs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
Learning Objectives
- Write a data processing program in Java using Apache Beam
- Use different Beam transforms to map and aggregate data
- Use windows, timestamps, and triggers to process streaming data
- Deploy a Beam pipeline both locally and on Cloud Dataflow
- Output data from Cloud Dataflow to Google BigQuery
Resources
The Github repository for this course can be found at https://github.com/cloudacademy/beam.
If you need to perform many transforms in your pipeline, you can make your code more readable and reusable by grouping related transforms together into higher-level transforms. These are called composite transforms.
For example, here’s the WordCount program, which is more sophisticated than MinimalWordCount. It combines the transform that splits lines into words and the transform that counts the number of occurrences of each word into a new composite transform called CountWords. This simplifies the pipeline code and makes it easier to understand.
The CountWords class is defined up here. To create a composite transform, you need to make it a subclass of PTransform and specify the input PCollection type and the output PCollection type. This one inputs a PCollection of Strings and outputs a PCollection of key/value pairs of type String and Long.
Then you need to override the “expand” method. This is where you put your two or more apply statements to transform one PCollection into another. The first one here splits a line into words and the second one count the number of times each word occurs.
Now that you know how this works, we’ll move on to a more complex program. It’s called UserScore and it’s back in the java8 directory tree.
This is a mobile gaming example where there are many players and they can each play the game multiple times per day. Each time someone plays the game, their activity is recorded in an event log. Each log entry includes a score. The UserScore program reads in these event logs, parses each log entry, adds up all of the scores for each player, and outputs the total score for each to text files.
I’m not going to go through all of the code in detail, but I’ll show you some of the highlights. First, have a look at the pipeline code to get a big picture view of what’s it doing. After reading in the data, it parses the game events, then it calculates the score for each user, and then writes the scores to text files.
This pipeline code makes it easy to understand what it’s doing at a high level. The names for the applies are particularly helpful for seeing what it does at a glance.
Now let’s look at how it performs these steps. First, to parse the game events, it uses the ParDo transform and passes it an instance of Parse Event Function, which is defined above. You’ll recall that when you use the ParDo class, you need to pass it a subclass of Do Function. That’s what Parse Event Function is.
In the comments up here, it tells you what an event log entry looks like. It has fields like username, teamname, and score. It also gives you an example of an entry here.
The ProcessElement method reads each line into a custom data type called GameActionInfo. Then it outputs all of the GameActionInfo objects to a PCollection.
After parsing the data, it applies a composite transform called ExtractAndSumScore. That’s defined here. It’s a subclass of PTransform and it specifies that the input PCollection contains objects of type GameActionInfo and the output PCollection contains key/value pairs of type String and Integer. That’s because each output element will be a player and a score.
In the expand method, it has two applies. The first one uses MapElements to extract the player and score from each GameActionInfo object. The second one uses Sum.integersPerKey to add up all of the scores for each player. It specifies String here because the key, which is the player’s name, is a String. The sum is an integer.
Alright, let’s run it. This is a pretty big job because, by default, it reads in two data files from Cloud Storage that have a total of about 22 gig of data. This is the sort of job that Dataflow can handle easily.
First, go into the java8 directory. Then paste this command.
While you’re waiting, have a look in the Dataflow console. Remember when I mentioned that the names you give to applies will show up on the pipeline graph? Well, here they are. The TextIO.Read apply wasn’t given a name in the code, so this box just shows the name of the transform itself. The rest of the boxes, on the other hand, all show the names they were given in the code.
If you click on a box, it’ll give you information about what has happened so far with that transform. If you click the down arrow in the ExtractUserScore box, it’ll show the sub-transforms within the composite transform. The second sub-transform, Combine.perKey(sumInteger), is actually a composite transform as well, so you can click on its down arrow to see its sub-transforms too.
If you want to see the logs for the job, click on Logs. If you want to search them, then click on Stackdriver. If you scroll up in the logs, you’ll see lots of messages about corrupt data. That’s because the UserScore program logs parse errors.
Another interesting detail is the autoscaling graph on the right. You can see that Dataflow spun up 5 workers for this job. It knew that it would be a bigger job than the LineCount job we ran earlier, so it decided to use 5 workers instead of 1.
It’s going to take a while longer to finish, so I’ll fast forward to the end. OK, it took about 6 minutes. Since this was a longer job that used more workers, you might be concerned about how much it will cost. We can figure that out by looking at the resource metrics below the graph.
Here’s the current pricing for Dataflow. My jobs run in the Iowa region, which is us-central1. As you can see, there are quite a few components to the price. The job we just ran was a batch job, so we need to use the prices in the first row.
This job didn’t use any SSD storage and it didn’t use Dataflow Shuffle, so we just need to plug the first three numbers into the pricing formula. Here’s the formula. Total vCPU time times .056 plus Total memory time times .003557 plus Total PD time times .000054. Now if we plug in the numbers from this job...we get about 4 cents. I think that’s a pretty good deal.
Now let’s have a look at the output. If you go to the bucket you created, there should be three files starting with the word “scores”. Have a look at the first one. Each line has the total score for a particular user.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).