Most organizations are already gathering and analyzing big data or plan to do so in the near future. One common way to process huge datasets is to use Apache Hadoop or Spark. Google even has a managed service for hosting Hadoop and Spark. It’s called Cloud Dataproc. So why do they also offer a competing service called Cloud Dataflow? Well, Google probably has more experience processing big data than any other organization on the planet and now they’re making their data processing software available to their customers. Not only that, but they’ve also open-sourced the software as Apache Beam.
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
In this course, you will learn how to write data processing programs using Apache Beam and then run them using Cloud Dataflow. You will also learn how to run both batch and streaming jobs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
- Write a data processing program in Java using Apache Beam
- Use different Beam transforms to map and aggregate data
- Use windows, timestamps, and triggers to process streaming data
- Deploy a Beam pipeline both locally and on Cloud Dataflow
- Output data from Cloud Dataflow to Google BigQuery
The Github repository for this course can be found at https://github.com/cloudacademy/beam.
As I mentioned in the introduction, Dataflow’s purpose is to run data processing pipelines. Here’s a more detailed view of what a pipeline looks like.
First, the pipeline reads data from an external source, which could be files or one of these Google Cloud services or a custom source. The data is read into something called a PCollection. The ‘P’ stands for “parallel” because a PCollection is designed to be distributed across multiple machines.
Then it performs one or more operations on the PCollection, which are called transforms. Each time it runs a transform, a new PCollection is created. That’s because PCollections are immutable. That is, you can’t change a PCollection. You can only read from it.
After all of the transforms are executed, the pipeline writes the final PCollection to an external sink, which can be any of the same data locations that are supported by the read operation.
I should mention that the read and write operations are technically transforms too. They’re just a special kind of transform because they don’t have a PCollection for both input and output.
Of course, pipelines can be more complicated than this, with multiple branches, as well as multiple sources and sinks, but we’re going to stick with this straightforward type of pipeline for most of this course.
Here’s a pipeline that’s about as small as possible while still doing something halfway useful. It reads the contents of Shakespeare’s King Lear play from a file in Cloud Storage and puts it in a PCollection, with each line of the play being an element in the PCollection. Then it counts the number of elements in the PCollection, converts the number to a string, and writes it to a file called “linecount”.
To put this pipeline in action, you’ll need a few things. Apache Beam supports two languages: Java and Python. However, Python support was only added recently, and you can’t use Python for streaming jobs yet, so I’m going to use Java in this course.
To run Beam pipelines using Java, you need JDK 1.7 or later. You can use either Maven or Eclipse. If you’re already an Eclipse user, then you should seriously consider installing the Cloud Dataflow plugin for Eclipse. In this course, though, I’m going to use Maven instead because there’s a way you can run Beam and Dataflow without having to install anything on your local machine.
You can do it by using Cloud Shell. To start it, go to the Google Cloud Console and click the Cloud Shell icon. It usually takes about 10 seconds to start up. Cloud Shell is great because it has everything you need already pre-installed and it’s free. Google also added a code editor to Cloud Shell, so it’s a pretty usable environment for writing code, too.
If you’d rather install everything on your own computer, then follow the instructions here.
Regardless of whether you’re using Cloud Shell or your own computer, I’m going to get you to download something. It’s the Github repository here. If you don’t want to type in the URL, then you can find it on the “About this course” tab below.
I forked this repository from the master Beam examples repo. It’s mostly the same, but I’ve added a few files and made some minor changes.
At the Cloud Shell prompt, type “git clone” and then the github URL. OK, it’s done downloading. You can open Cloud Shell in its own tab by clicking here.
Now open the code editor by clicking on the pencil icon. In the file tree at the left, click on “beam”. Since the file I want to show you is buried pretty deeply in the directory structure, it’s quicker to just search for the file. In the File menu, select “Open” and then type “minimalline”. It should find three files. Click on “MinimalLineCount.java”.
This is how you would create the pipeline using Beam. First, you have to import a bunch of Beam classes. Then, you create the pipeline, but you have to specify the pipeline options, which is why you set an options variable. In this case, we’re just using default options. I’ll show you how to set different options later.
Then you chain a series of apply methods to the pipeline. The first one uses TextIO.read to input the kinglear text file from Cloud Storage.
The next one uses a transform called Count.globally to simply count all of the elements in the PCollection, which in this case will tell us the number of lines in King Lear.
To see which pre-written transforms are available, have a look at the org.apache.beam.sdk.transforms package and its subpackages. On beam.apache.org, click on “Documentation” and then “Java SDK API Reference”. Then click on the transforms package. Here’s the Count transform...and here’s the globally method.
As you can see, there are a lot of pre-written transforms. Here’s a list of some commonly used ones.
OK, back to the code. Now we have to convert the count to a string so we can print it to a file. Here we use a transform called MapElements. This one’s a bit different because we have to pass it an instance of class SimpleFunction, which is also a transform. Then we have to override SimpleFunction’s apply method and tell it how to convert an element. In this case, we’re inputting an element of type Long and outputting an element of type String.
Did that seem that seem like quite a bit of work just to convert a Long to a String? I agree, so I wrote another version that does it in a simpler way. Have a look at MinimalLineCountLambda. This one uses a lambda function, which is much simpler to read. The catch with this is that lambda functions are only available in Java 8, so this won’t work if you’re compiling with Java 7.
The next transform is TextIO.write, which writes the number to a file called “linecount”.
Now, all of that code constructed the pipeline, but it didn’t actually run it. To do that, you need to use the pipeline’s run method. We’re also going to add the waitUntilFinish method so that the program won’t end until the pipeline is finished running. If you don’t use this method, then the program will put the process in the background and exit, so you’ll have to check periodically to see if the pipeline is finished running.
OK, let’s run it. Some of the commands you’ll be using in this course are quite long, so I’ve put them in the course.md file. You can copy and paste them from there.
First, go into the beam/examples/java8 directory. Then type “mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.MinimalLineCount”.
It spits out a lot of output, which you can ignore unless there’s a problem. When it’s finished, have a look in the directory to see if there’s a file called “linecount”. There is...sort of. It has 0 of 1 at the end of the filename. That’s because Beam shards output into multiple files by default. It does this because there is usually a huge amount of output from Beam jobs. In this case, there’s only one line in the output, so there’s only one file.
Look at what’s in the file. You can hit Tab after typing in the first part of the filename and it will fill in the rest of the name. It says 5,525. That’s how many lines there are in King Lear. Of course, we could have figured that out much more easily by using this command. But let’s face it, this is a pretty trivial example of a Beam program. We’ll create more useful pipelines later on in this course.
Before we go, I want to show you one more thing. Here’s how you could write this code in Python. It’s a lot easier to read, isn’t it? It’s easier to code, too. I expect that when Beam’s Python SDK has the same features as the Java SDK, it will become very popular. In the meantime, we’ll continue going through Java code in this course.
It probably wasn’t obvious when we ran the MinimalLineCount program, but we didn’t actually use the Dataflow service. Everything ran on the Cloud Shell VM. In the next lesson, I’ll show you how to run it on Dataflow.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).