Most organizations are already gathering and analyzing big data or plan to do so in the near future. One common way to process huge datasets is to use Apache Hadoop or Spark. Google even has a managed service for hosting Hadoop and Spark. It’s called Cloud Dataproc. So why do they also offer a competing service called Cloud Dataflow? Well, Google probably has more experience processing big data than any other organization on the planet and now they’re making their data processing software available to their customers. Not only that, but they’ve also open-sourced the software as Apache Beam.
Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job.
In this course, you will learn how to write data processing programs using Apache Beam and then run them using Cloud Dataflow. You will also learn how to run both batch and streaming jobs.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
- Write a data processing program in Java using Apache Beam
- Use different Beam transforms to map and aggregate data
- Use windows, timestamps, and triggers to process streaming data
- Deploy a Beam pipeline both locally and on Cloud Dataflow
- Output data from Cloud Dataflow to Google BigQuery
The Github repository for this course can be found at https://github.com/cloudacademy/beam.
The UserScore and HourlyTeamScore programs wrote their output to text files, but LeaderBoard does something different. It writes to BigQuery tables.
BigQuery is Google’s data warehouse service. It’s a good place to put the output data from LeaderBoard because it can hold massive amounts of data at low cost and it lets you analyze that data by running SQL queries on it.
To write user scores to BigQuery, LeaderBoard uses the WriteToBigQuery class from utils. This is a transform, so to see a high-level view of what it does, let’s look at the expand method. First, it applies a transform that converts each data record into a row.
Then it applies the BigQueryIO.writeTableRows transform to write those rows to a table in BigQuery. Here’s what the transform needs to know. First, you have to tell it which table to write to. You need to include the project ID, the dataset ID, and the table name. The dataset must already exist, but the table doesn’t necessarily have to. If you set CreateDisposition to CREATE_IF_NEEDED, then it will automatically create the table if it doesn’t exist.
Then you need to tell it what the schema of the table should look like. The method that does that (getSchema) is right here. It goes through a for loop and adds all of the fields that are needed. Here’s the line that adds a field. It specifies the name of the field’s key and the data type of the field’s value.
Finally, you need to specify the WriteDisposition, which says if the table must be empty or if it should be overwritten or if new data should be appended to it. Since LeaderBoard is constantly streaming new data to the table, this needs to be set to append.
That’s it for writing to BigQuery. Now, although LeaderBoard doesn’t read from BigQuery, I’ll show you how that works too.
If you want to read a whole table, then it’s very easy. Here’s what the code looks like. You just need to specify your project ID, then a colon, then the name of the dataset, which is “game” for LeaderBoard, and the name of the table, which is leaderboard_user for the table that holds the user scores.
However, you probably wouldn’t want to read the entire user table because it would be huge if the game had been running for a while. It would be better to run a query. Instead of using the from method, you use the fromQuery method. This example selects the user and total_score fields. It’s written in standard SQL rather than BigQuery’s legacy SQL, so you have to add the usingStandardSql method.
To be honest, this query wouldn’t be much better than reading in the whole table, because it only leaves out one field in the table, and it’s a pretty important field because it contains the processing time for each event. What you’d really want to do is get LeaderBoard to create a partitioned table and then write a query that reads from a subset of partitions. But that’s way outside the the scope of this course. If you want to learn more about partitioned tables, then you can take my Optimizing Google BigQuery course.
OK, there were a lot of new concepts in the LeaderBoard program, weren’t there? In the next lesson, I’ll show you how to run it.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).