Most organizations are already gathering and analyzing big data or plan to do so in the near future. One common way to process huge datasets is to use Apache Hadoop or Spark. Google even has a managed service for hosting Hadoop and Spark. It’s called Cloud Dataproc. So why do they also offer a competing service called Cloud Dataflow? Well, Google probably has more experience processing big data than any other organization on the planet and now they’re making their data processing software available to their customers. Not only that, but they’ve also open-sourced the software as Apache Beam.
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
In this course, you will learn how to write data processing programs using Apache Beam and then run them using Cloud Dataflow. You will also learn how to run both batch and streaming jobs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
Learning Objectives
- Write a data processing program in Java using Apache Beam
- Use different Beam transforms to map and aggregate data
- Use windows, timestamps, and triggers to process streaming data
- Deploy a Beam pipeline both locally and on Cloud Dataflow
- Output data from Cloud Dataflow to Google BigQuery
Resources
The Github repository for this course can be found at https://github.com/cloudacademy/beam.
Now that you’ve seen how LeaderBoard works, it’s time to run it, but it’s not as simple as it was to run the other examples, because it uses two other Google Cloud services. The streaming data comes from an example program called Injector, which then gets passed through Cloud Pub/Sub. The output, as you’ve seen, gets written to BigQuery.
Pub/Sub is Google’s messaging service. It lets you send and receive messages between applications. Like all of Google’s cloud services, it’s built for scale. It can send over a million messages per second!
Please make sure you have access to these two services before you run the program. You also need to create a BigQuery dataset to hold the output data. Type “bq mk”, which stands for BigQuery make” and then use “game” for the dataset name.
OK, now to run the injector, you also need to download a service account credentials file and authenticate using it. You wouldn’t think this would be necessary, but if you don’t, then you’ll hit a Pub/Sub quota limit, because Injector pushes data into Pub/Sub at a furious pace. If you don’t authenticate using the service account credentials, then by default, you’ll authenticate with your user account, which has quota limits.
First, go to the API Console Credentials page. The link is in the course.md file. You should open the link in a new tab.
From the project drop-down, select your project. In the Create credentials drop-down, select “Service account key”. Then select one of the service accounts. Leave the key type on the JSON option and click Create.
When it’s done downloading, you need to upload it to Cloud Shell. From the menu in the upper-right-hand corner, select “Upload file”. Then pick the credentials file you just downloaded.
Now you need to set an environment variable that points to the file. To save yourself some typing, first do a “pwd”...and then do an “ls”. Now copy “export GOOGLE_APPLICATION_CREDENTIALS=”. If you didn’t also copy the opening quote, then add one. Then copy and paste your home directory and add a slash. Now copy and paste the credentials filename and add a closing quote.
OK, we’re finally ready to run the injector. As usual, go into the java8 directory. Now paste this command. I’ll fast forward a bit.
Alright, the injector is pumping data into Pub/Sub. Click the plus sign to start another Cloud Shell session. Now we can run LeaderBoard. Since this is a new shell session, we have to go into the java8 directory again. Then paste this command.
The new options in this command are “dataset=game”, which is the name of the BigQuery dataset, and “topic=projects/$PROJECT/topics/game”. This is the topic for Pub/Sub. That’s where it’s retrieving the data from, and that’s where Injector is pushing the data into.
It seems to be running properly. Let’s have a look in the Dataflow console. The graph shows that it’s reading from Pub/Sub, calculating both the user scores and team scores, and writing out the results. If you click on the down arrow, you’ll see that it’s writing to BigQuery. I’ll fast forward about 10 minutes.
OK, now let’s have a look in BigQuery. Here’s the “game” dataset. If you click on the arrow, you should see new tables called “leaderboard_team” and “leaderboard_user”. Let’s see if any data shows up in the preview. Nope. It’ll take a while before any records show up in the preview, but you can still access the records by doing a query on the table.
First click on “Show Options” and uncheck “Use Legacy SQL”. Now click the “Query Table” button again. You’ll notice that the SQL statement changed a bit because now it’s standard SQL instead of legacy SQL. Add an asterisk after the SELECT, then click “Run Query”. It’ll warn you about reading all of the records in the table, but there should only be about 60 records at this point, so it’s not a concern. Click “Hide Options” again to get that out of the way.
If we go back to Dataflow and click on the WriteTeamScoreSums box, it’ll tell us how many elements have been added to the table so far. Here it says 61. Oh, now it’s 63, because data keeps streaming in. There were 60 elements when we ran the query.
Now have a look at the leaderboard_user table. It doesn’t have anything in the preview yet either. Let’s run a quick query on it too. Put in an asterisk again. You’ll see that it has a lot more records in it because it has scores for individual users rather than teams.
Now let’s have a look in Pub/Sub. It has 2 subscriptions. Click on the topic name. One of these subscriptions was created by the LeaderBoard program and the other one was created by Dataflow itself internally. There isn’t much else to see here in Pub/Sub.
When you’re done looking at everything, you can stop the injector with a Control-C. To stop the Dataflow job, click the “Stop job” button. It gives you two options: Cancel and Drain.
Cancel will immediately stop your job, but any buffered data may be lost. Drain will stop ingesting data, but it will attempt to finish processing any remaining buffered data. Since we’re not worried about losing data in this example, you can choose “Cancel”.
This was the final hands-on example in this course, so if you don’t want to incur any more charges for the data that you generated, then you can go ahead and delete the files in your dataflow bucket in Cloud Storage and the leaderboard tables in BigQuery.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).