hands-on lab

Comparing Google Cloud Big Data Services

Intermediate
Up to 45m
781
4/5
Get guided in a real environmentPractice with a step-by-step scenario in a real, provisioned environment.
Learn and validateUse validations to check your solutions every step of the way.
See resultsTrack your knowledge and monitor your progress.
Lab description

In this lab, you will understand first-hand the difference between some of Google Cloud's Big Data solutions: BigQuery, Dataproc, and Dataflow. Specifically, you will create a BigQuery Dataset and table. You will then upload data to BigQuery directly, using Dataproc and Dataflow.

The following provides a brief high-level comparison between the services to review before starting the lab.

Comparing BigQuery, Dataproc, and Dataflow

BigQuery is a serverless data warehouse that can ingest, store, and query petabyte-scale data. BigQuery provides you with SQL capabilities for querying the data.

Cloud Dataproc is a managed Hadoop platform that includes Apache Spark, Flink, Hive, and other open-source tools and frameworks. Dataproc will provide you with full programming language capabilities in contrast to BigQuery's SQL-only query interface. If you need complex scripts to transform your data, then Dataproc may be a good solution.

Cloud Dataflow is a serverless data processing service fully managed by GCP that provides you with a platform for Apache Beam projects without having to worry about the underlying layer of a cluster, including load balancing and auto-scaling the number of workers for a job. In contrast to Dataproc where your code is tightly coupled to the job runner, i.e. the underlying platform, Cloud Dataflow allows you to focus on your business logic rather than focusing on how the underlying layer works. Cloud Dataflow also offers various ready-made templates from which to choose when establishing a task, making the process even easier.

The pricing models for each service also differ. BigQuery pricing varies based on the amount of data stored, ingested, and amount of data processed while executing queries. Dataproc pricing is based on the number of virtual CPUs in the cluster. Dataflow pricing is based on the CPU, memory, and persistent disk resources used while executing jobs.

Learning Objectives

Upon completion of this lab, you will be able to:

  • Choose among different Big Data Services available in GCP
  • Create ETL Pipelines to load data from GCS to BigQuery
  • Query the Data available in BigQuery

Intended Audience

This lab is intended for:

  • Candidates for the Associate Cloud Engineer Certification Exam
  • ETL Developers
  • Data Engineers
  • Cloud Architect

Prerequisites

The following prerequisite is beneficial but not required for completing this lab:

  • A basic understanding of SQL and Python

Updates

March 30th, 2024 - Resolved Dataproc cluster creation issue

August 10th, 2023 - Addressed user ban issue and promptly added a warning

July 25th, 2023 - Updated Python source code to resolve an issue preventing the PySpark Dataproc job from finishing

March 9th, 2023 - Updated lab to use Apache Beam 2.45.0

 

Environment before
Environment after
About the author
Avatar
Logan Rakai, opens in a new tab
Lead Content Developer - Labs
Students
216,433
Labs
223
Courses
9
Learning paths
56

Logan has been involved in software development and research since 2007 and has been in the cloud since 2012. He is an AWS Certified DevOps Engineer - Professional, AWS Certified Solutions Architect - Professional, Microsoft Certified Azure Solutions Architect Expert, MCSE: Cloud Platform and Infrastructure, Google Cloud Certified Associate Cloud Engineer, Certified Kubernetes Security Specialist (CKS), Certified Kubernetes Administrator (CKA), Certified Kubernetes Application Developer (CKAD), and Certified OpenStack Administrator (COA). He earned his Ph.D. studying design automation and enjoys all things tech.

LinkedIn, Twitter, GitHub

Covered topics
Lab steps
Signing In to the Google Cloud Console
Creating a BigQuery Dataset and Uploading CSV File
Creating a Dataproc Cluster and Submitting a Job
Creating a Dataflow Job to Fetch Data and Store it in BigQuery