A Simple Pipeline
More Complex Pipelines
Dealing with Time
The course is part of these learning paths
Most organizations are already gathering and analyzing big data or plan to do so in the near future. One common way to process huge datasets is to use Apache Hadoop or Spark. Google even has a managed service for hosting Hadoop and Spark. It’s called Cloud Dataproc. So why do they also offer a competing service called Cloud Dataflow? Well, Google probably has more experience processing big data than any other organization on the planet and now they’re making their data processing software available to their customers. Not only that, but they’ve also open-sourced the software as Apache Beam.
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
In this course, you will learn how to write data processing programs using Apache Beam and then run them using Cloud Dataflow. You will also learn how to run both batch and streaming jobs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
- Write a data processing program in Java using Apache Beam
- Use different Beam transforms to map and aggregate data
- Use windows, timestamps, and triggers to process streaming data
- Deploy a Beam pipeline both locally and on Cloud Dataflow
- Output data from Cloud Dataflow to Google BigQuery
The Github repository is at https://github.com/cloudacademy/beam.
As you saw in the last lesson, you don’t need to use Cloud Dataflow to run Beam jobs. When you need to run a big data processing job, though, it’s a lot easier to use Dataflow than to build your own processing cluster.
The MinimalLineCount program definitely doesn’t qualify as a big job, but let’s start with it so you can see the basics of running a Beam program on Dataflow.
MinimalLineCount has all of its options hardcoded in it, so in order to run it on Dataflow, we need to make some changes. I created another program called MinimalLineCountArgs that has these changes.
This version of the program reads command-line arguments to set the options rather than hardcoding them. To get the program to read from the command line, we call the fromArgs method on the PipelineOptionsFactory. We also need to append the “as” method to tell the PipelineOptionsFactory to return an object that implements PipelineOptions.
Alright, let’s run it.
To simplify the command, let’s set a couple of environment variables first. Since we’re also going to use these variables throughout the course, it would be easiest to add them to your .profile so you don’t have to run them again if you start a new session.
If you’re familiar with the vi editor, then you might want to use that, but if you’re not, then use nano. Type “nano .profile”. Now go to the bottom of the file.
The first variable to set is PROJECT. Type “PROJECT=”. If you’re not sure what your project ID is, then go to the home page of the Cloud Console at console.cloud.google.com. You can copy it from here. Now paste it here.
Then create a variable called BUCKET to hold the name of the Cloud Storage bucket you’re going to use to hold the temporary files. Since Cloud Storage buckets need to have globally unique names across all customers, it’s usually a good idea to embed your project ID in the bucket name. Let’s call it dataflow dash and your project ID, which we can get from the PROJECT variable now.
To save the file in nano, hit Control-O. Press Enter to write to the same filename. Then hit Control-X to exit. There are a few ways to run the .profile and set the new variables, but let’s do it by opening a new Cloud Shell session. Click the plus sign. Then close the other session.
Now create the Cloud Storage bucket. Type “gsutil mb”, which stands for “make bucket”, and then $BUCKET.
Before we run the program, we need to get back into the java8 directory, so type “cd beam/examples/java8 directory”. You can copy the cd command from the course.md file.
OK, it’s finally time to type the command to run the job on Dataflow. You’ll definitely want to copy this one from the course.md file. The first part is the same. We use Maven to compile LineCount. Then we use “-Dexec.args” for the arguments that will be sent to the program. First, we put in the runner.
The runner defines what back-end will be used to run the pipeline. When we ran the MinimalLineCount pipeline on our local machine in the last lesson, we didn’t specify a runner. That’s because the default runner is DirectRunner, which runs locally. This time we need to run it on the Dataflow service, so we need to set it to “DataflowRunner”.
Then we put in the project argument and then the tempLocation, which is where it will put the temporary files as well as the staging files.
Alright, this time it’s going to spin up a virtual machine on Google Cloud, so let’s have a look at what’s happening. On the Cloud Console, scroll down to the bottom of the menu and select Dataflow.
It’s going to take a while to prepare the Dataflow job, so I’ll fast forward.
You should see your job running. It’ll have a name starting with “minimallinecountargs”. If you click on it, you’ll see a graph of your pipeline. On the right, it has some information about the job, such as the elapsed time. You can see the elapsed time going up every second. It also shows that it has spun up one worker machine to run this pipeline. Considering how small this job is, it’s not surprising that it only spun up one machine.
OK, the job status says “Succeeded”. Now, where did it write the output file? Well, actually it didn’t because we didn’t specify an output location in Cloud Storage. So the job was kind of a waste of time since we can’t see the output. Let’s go back to the code.
To fix it, you could just put a Cloud Storage path here, but let’s stop hardcoding things and use a more flexible solution. Have a look at LineCount. It defines input and output arguments that you can enter on the command line.
The way it does that is it creates a sub-interface of PipelineOptions. I called it LineCountOptions. Then it creates get and set methods for the custom options. In this case, it defines two options: inputFile and output.
Each of them has a Description annotation that will be used as help text. The inputFile option also has a Default annotation, so if you don’t specify an inputFile argument on the command line, it’ll use this kinglear file.
The output option doesn’t have a default. In fact, it’s a required option. You have to specify it on the command line or the program won’t run. We’re not setting a default because the name of my Cloud Storage bucket is different from yours, so there isn’t a default that will work for everyone.
Now, to get the program to accept the input and output options, we set the options variable to be of type LineCountOptions. Then we append the withValidation method, which checks for required command-line arguments. We also need to make the “as” method pass LineCountOptions.
And finally, we need to change the read and write transforms to use the command line arguments by calling the get methods for the inputFile and output arguments.
Let’s go back to the course.md file to get the command to run it. We use the same command as before except we change the program name to “LineCount” and add the output option. We set it to $BUCKET/linecount.
If you get an error message saying that the one or more Google Cloud APIs are not enabled, then go to the URL it specifies and enable them.
OK, it takes a little while to run, so I’ll fast forward to when it’s done. There. This time the output file will be in the Cloud Storage bucket, so go to the Cloud Storage console...and then click on the dataflow dash project bucket. You should see a file that starts with “linecount”. If you click on it, you’ll see that it contains the number 5525, just like when you ran it locally.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).