Big Data & Machine Learning


Start course
2h 13m

If you’re going to work with modern software systems, then you can escape learning about cloud technologies. And that’s a rather broad umbrella. Across the three major cloud platform providers, we have a lot of different service options, and there’s a lot of value in them all.

However, the area that I think Google Cloud Platform excels in is providing elastic fully managed services. Google Cloud Platform to me, is the optimal cloud platform for developers. It provides so many services for building out highly available - highly scalable web applications and mobile back-ends.

For me personally, Google Cloud Platform has quickly become my personal favorite cloud platform. Now, opinions are subjective, but I’ll share why I like it so much.

I’ve worked as a developer for years, and for much of that time, I was responsible for getting my code into production environments and keeping it running. I worked on a lot of smaller teams where there were no operations engineers.

So, here’s what I like about the Google Cloud Platform, it allows me to think about the code and the features I need to develop, without worrying about the operations side because many of the service offerings are fully managed.

So things such as App Engine allow me to write my code, test it locally, run it through the CI/CD pipeline, and then deploy it. And once it’s deployed, for the most part, unless I’ve introduced some software bug, I don’t have to think about it. Google’s engineers keep it up-and-running, and highly available. And having Google as your ops team is really cool!

Another thing I really like about is the ease of use of things such as BigQuery and their Machine Learning APIs. If you’ve ever worked with large datasets, you know that some queries take forever to run. BigQuery can query massive datasets in just seconds. Which allows me to get the data I need quickly, so I can move on to other things.

And with the machine learning APIs, I can use a REST interface to do things like language translation, or speech to text, with ease. And that allows me the ability to integrate this into my applications, which gives the end-users a better user experience.

So for me personally, I love that I can focus on building out applications and spend my time adding value to the end-users.

If you’re looking to learn the fundamentals about a platform that’s not only developer-friendly but cost-friendly, then this is the right course for you!

Learning Objectives

By the end of this course, you'll know:

  • The purpose and value of each product and service
  • How to choose an appropriate deployment environment
  • How to deploy an application to App Engine, Kubernetes Engine, and Compute Engine
  • The different storage options
  • The value of Cloud Firestore
  • How to get started with BigQuery


This is an intermediate-level course because it assumes:

  • You have at least a basic understanding of the cloud
  • You’re at least familiar with building and deploying code

Intended Audience

  • Anyone who would like to learn how to use Google Cloud Platform



Hello, and welcome. In this lesson, we'll talk about the GCP services used for big data, machine learning, and IoT. Our big data discussion will cover Big Query, Pub/Sub, Dataflow, Dataproc, Datalab, and Dataprep. For machine learning, we'll talk about the AI Platform as well as the Vision, Speech and Translate APIs. And for IoT, we'll talk about IoT Core. So if you're ready, let's dive in.

There are plenty of companies out there that know big data and machine learning, and Google ranks towards the top of that list. These are the things that Google has been using as part of their core business for some time. And through the Google Cloud Platform, they've given us the ability to use the same tools that they do. The big data services are designed to scale the same way that Google's internal services do, which means you don't need to worry about traffic spikes causing problems, the services are designed to be elastic. These services are fully managed, so they don't require any effort from our operations teams. BigQuery is an analytics database that allows you to stream data at about 100,000 rows per second. Pub/Sub is a scalable and flexible enterprise messaging queue. Dataflow allows you to perform stream and batch processing. And Dataproc is a managed Hadoop, MapReduce, Spark, Pig, and Hive service.

Let's dive into each of these just a bit more. BigQuery is a fully managed, petabyte scale, low cost analytics data warehouse. You can use a familiar SQL syntax to query the database, making the learning curve smaller since SQL is a familiar language. Let's check it out. We can query some publicly available datasets. Let's use the GitHub data and see how many projects per language there are. So we're just gonna write out some SQL. And now let's let this run and see what happens. Now let's change the order. And we'll switch it to the Amount field. Look how fast that runs. Let's run another query against the Wikipedia data. Let's check and see how many of the titles have the word Nintendo in it. Nice, so we have quite a few records returned. And now if we change it to cloud, okay, we get a massive amount of data back in just a few seconds. So BigQuery makes querying massive data sets very simple. So this is a very cool thing. 

Next up, we have Cloud Pub/Sub, which is a fully-managed, real-time messaging service that allows you to send and receive messages between independent applications. You can use Pub/Sub to decouple systems and components hosted on Google Cloud Platform or elsewhere. By building on the same technology that Google uses, Pub/Sub is designed to provide at least once delivery with low latency, and on-demand scaling for up to one million messages per second. Pub/Sub allows you to subscribe to and to publish messages. For example, you could publish a message that contains the IDs of objects that need to be invalidated in a distributed cache. And then you could have some code that subscribes to those messages and actually execute to refresh the cache for those IDs. Pub/Sub is one of those tools that's really multi-purpose. I like to use it to decouple services. For example, if I have a website that accepts image uploads, I'll publish the location of the file on cloud storage to a resize topic, and then my resize service can pick it up and process it. And then I can publish a message to a thumbnail topic and generate a thumbnail. And finally, I could publish a message to a completion topic and the notification service can inform the user that their image's done being processed. So, you can use it for different things. But it is a very fast, highly available messaging option. And it's a really good service to know. So I suggest you test that out. 

Next up, we have Cloud Dataflow, which is a service that allows you to create data pipelines. Dataflow provides a programming model for both batch and streaming data processing pipelines. It'll allow you to create ETLs, batch computation and continuous computation pipelines. It integrates with Cloud Storage, Cloud Pub/Sub, BigQuery, and BigTable, and has SDKs for Java and Python. 

Next we have Dataproc, which is a managed way to run Hadoop and Spark, Hive and Pig on the Google Cloud Platform. Dataproc will allow you to quickly create clusters that are built by the minute and will scale up and down as needed. With Dataproc, you can easily migrate on-premises Hadoop jobs to the cloud. 

Next, we have Datalab, which is an interactive tool that allows you to explore, transform, analyze and visualize your data. It's currently in beta, though that shouldn't stop you from checking it out if you require big data processing. You can use Python, SQL, and JavaScript to interact with your data. Datalab is built on Jupyter and is deployed as an App Engine application. Check it out, we can load a notebook and run some code inside of it. And then we can save it and share it as needed. And we can interact with BigQuery, allowing us to get the most from Datalab. 

So between BigQuery, Pub/Sub, Dataflow, Dataproc, and Datalab, we have quite the set of services for big data. And the list doesn't stop here. If you ask data analyst what the most time-consuming part of their job is, they will likely tell you it's data cleaning or data munging. Cleaning and transforming data takes up a lot of time. And the less control you have over the data collection, the more likely that you're going to need to prepare it before you can use it. 

Cloud Dataprep is a service provided by a Google partner called Trifacta. And it's used to visually explore, clean and prepare data for analysis and machine learning. All right, first off, we have this nice user interface kind of looks like a spreadsheet. Notice this reddish color here in this ID column. Dataprep notice that all of our rows are using a number for the ID except for this fourth row here, which contains the letter A. By clicking on this red section here, it's going to show the rows that are problematic, and we just have the one. In this example, let's imagine the data is correct and that this A represents an actual ID, it's just string rather than a number. For that, we can click on this drop-down under the column and change the data type. And no more warnings. 

Now, if this was a real dataset, we'd likely want to normalize this price data. Notice that these each have different formats. Let's see if we can find a simple way to remove any superfluous formatting. Okay, let's try this, let's try number extraction. And yeah, that looks pretty good. So if we wanted to keep this, we could click Add. And notice it adds a new column. It also has a red warning. It's trying to tell us that this column has both floats and ints. So it's not exactly one data type. If this was a real project, we could clean it up. But for a demo, this is fine. 

Dataprep has a lot of possible methods for transforming data. It can replace data, extract data, it can count pattern matches, it can split columns, it can merge columns, it can format data, it can pivot and unpivot. It can filter rows, it can run functions, etc. Once you've added all of your changes to the recipe, you can run it by clicking the Run Job in the top corner here, which will result in a newly created file with all of the changes that we specified. 

Okay, next up, let's take a look at some of the Google Cloud AI services. We can break these services into roughly two groups, services that we use to create our own machine learning solutions and services that kind of serve as building blocks for adding machine learning to our own applications. AI Hub and AI Platform are both used for building machine learning solutions. AI Hub is Google's hosted repository of plug-and-play AI components. And the AI Platform is used to train and deploy machine learning models. You can use AI Platform to label your training data to create models with Jupyter notebooks and to train and deploy your models. 

The building blocks services include the Vision API, the Speech API, and the Translation API, as well as several others. Let's talk about the Vision API, which is a pre-trained model for analyzing images. Now you can do things like facial detection, logo detection, label detection, etc. It quickly classifies images into thousands of categories. And it can find and read printed words contained within images. You can analyze images uploaded in the request or integrate it with the image stored in Google Cloud Storage. 

Let's actually check it out, I have a picture that I've uploaded. It's a picture of an espresso-based beverage. And if we use the API Explorer, we can test it out and see if the Vision API can recognize it. Okay, let's run this. And it looks like it's done a pretty good job. It has a few guesses of cappuccino, latte, coffee, etc. So I think it's done a nice job at identifying that this is a coffee-based beverage. There's a lot more power here baked in that we haven't shown. So, check it out if you need to work with image recognition. 

Next up, let's talk about the Speech API, which will allow you to convert audio to text by applying powerful neural network models in an easy to use API. Currently, it recognizes over 80 languages and variants. You can transcribe the text of users dictating to an application's microphone, enable command to control through voice, or transcribe audio files, among other use cases. Let's use the API Explorer and see this in practice. We'll try and translate a simple audio file. And the result is in and it says, how old is the Brooklyn Bridge. So that's really cool. This is a really easy to use API. We can pass in some file and have it actually return the text with a guess at how accurate it thinks it is. 

Next up, we have the Translation API, which is an API for translating an arbitrary string into any supported language. The Translate API is a highly responsive API. So websites and applications can integrate with Translate for very fast, dynamic translation of source text. It also supports language detection for those cases where you don't know what the source language is. Let's check out the API Explorer and test this out. Let's try and have it detect some English text. I'll type in hello world. And it has a guess of English though it doesn't seem convinced that that's thoroughly accurate. Just adjust the text again see if we can make it more accurate. Okay, it's not but still, it's able to determine this is English. Let's try something else. Let's do some translation. Let's set the string to hello world again. And we'll set an out output language, and a source language. And there we have it. And if we change the language, we can rerun this, and see that it translates again without issue. 

These machine learning APIs are the same ones that Google uses for its apps. If you've ever used the voice search functionality in the Google app, or the image detection in Google Photos, then you've already interacted with some of these machine learning APIs. 

So far, we've looked at the different services for big data and machine learning. Now let's pivot to an area of tech that actually produces enough data to be considered big, and that is, of course, IoT. According to Google, the purpose of Cloud IoT Core is to provide a fully-managed service with which we can easily and securely connect, manage and ingest data from globally dispersed devices. Let's break that down and talk about how it accomplishes its purpose. IoT Core can send data to devices. When devices send data, it's called device telemetry. The type of telemetry data a device sends depends on what that device does. For example, a remote patient monitoring device would send different vital signs as its telemetry data, fleet management devices might send a geolocation telemetry, etc. 

IoT Core supports two protocols which are HTTP and MQTT. HTTP is stateless, it tends to have fewer firewall issues, and it's commonly understood by engineers which can make it a good choice in some circumstances. MQTT is designed for IoT. It is stateful, it has lower bandwidth usage, and it increases throughput. When devices connect to IoT Core, they interact with the Protocol Bridge. The HTTP and MQTT bridges provide different device connection options based on the capabilities of the protocol it's using. When sending telemetry data, the protocol bridge takes that data and then it routes it to Pub/Sub. And what that does is it allows you to consume that data through different services. IoT Core has the ability to send data to devices, and that's in the form of either configuration data or command data. 

The difference between configuration and command data is that configuration data is used to specify the desired state of a device. Configuration data is limited to one update per device per second. An example of this might be sending a new configuration in JSON format, which specifies the maximum allowed temperature before turning on a fan. Commands are a bit different, commands are one time instructions that are sent to only devices currently connected. Commands are only supported for MQTT. An example of command might be something like to instruct a device to reboot itself. 

IoT Core supports per device authentication using public-private keys with support for key rotation. Using IoT Core starts by creating a registry which is where devices with a similar purpose are defined. Along with the settings for which protocols to use, etc. All right, let's see how we can send commands and configuration changes to devices. I've already created a registry and a device inside of IoT Core. So now I want to set up a device simulator that's running on Cloud Shell. And then I'll switch over to IoT Core so we can test it out. 

Okay, let's clear the screen, I'll paste in the command to start the simulator. By default, this sends 100 messages, and then it stops. So while this is running, this is going to behave like an actual device because it's using the same IoT Core library that we'd use in our real device. Let's start by sending a command. On the device details page for the device we want to interact with, I'll click Send Command, which opens up this modal. And remember, it's up to the code on the device to know how to handle commands. 

This simulator is just going to print it out to the terminal. So we'll send the text, GO FASTER. And we'll switch tabs. And notice here it is. Again, this requires MQTT. And the devices need to be connected in order to use commands. Let's specify some configuration data next. I'm going to use JSON for this. Though really, it could be in any format. Again, it's up to the device. And here we are. And I'll send this and switching over, here it is. 

Alright, that's going to do it for this lesson. I hope this has been a helpful introduction to the big data, machine learning and IoT solutions on Google Cloud. Thank you so much for watching, and I'll see you in the next lesson.

About the Author
Ben Lambert
Software Engineer
Learning Paths

Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.