CloudAcademy
  1. Home
  2. Training Library
  3. Google Cloud Platform
  4. Courses
  5. Building Convolutional Neural Networks on Google Cloud

Scaling

The course is part of these learning paths

Machine Learning on Google Cloud Platform
course-steps 2 certification 1

Contents

keyboard_tab
Introduction
Convolutional Neural Networks
Improving a Model
Scaling
7
Scaling4m 50s
Conclusion
8
play-arrow
Start course
Overview
DifficultyAdvanced
Duration38m
Students515

Description

Course Description

Once you know how to build and train neural networks using TensorFlow and Google Cloud Machine Learning Engine, what’s next? Before long, you’ll discover that prebuilt estimators and default configurations will only get you so far. To optimize your models, you may need to create your own estimators, try different techniques to reduce overfitting, and use custom clusters to train your models.

Convolutional Neural Networks (CNNs) are very good at certain tasks, especially recognizing objects in pictures and videos. In fact, they’re one of the technologies powering self-driving cars. In this course, you’ll follow hands-on examples to build a CNN, train it using a custom scale tier on Machine Learning Engine, and visualize its performance. You’ll also learn how to recognize overfitting and apply different methods to avoid it.

Learning Objectives

  • Build a Convolutional Neural Network in TensorFlow
  • Analyze a model’s training performance using TensorBoard
  • Identify cases of overfitting and apply techniques to prevent it
  • Scale a Cloud ML Engine job using a custom configuration

Intended Audience

  • Data professionals
  • People studying for the Google Certified Professional Data Engineer exam

Prerequisites

This Course Includes

  • Many hands-on demos

Resources

The github repository for this course is at https://github.com/cloudacademy/ml-engine-doing-more.



Transcript

Google gives you tremendous flexibility in what compute resources you can use for your machine learning jobs. So far, we’ve only used ML Engine’s STANDARD_1 scale tier, which deploys 4 worker nodes. But suppose that you wanted to train a model using 8 worker nodes. None of the predefined scale tiers would work, because the BASIC tier only has one worker and the PREMIUM_1 tier has 19 workers!

 

To have anything other than 1, 4, or 19 workers, you need to specify a custom tier. To do that, you need to create a configuration file in YAML format. I’ve included an example called custom.yaml in the github repository.

 

It’s pretty simple. You start off with “trainingInput:” Then you specify the masterType. Master refers to the master node in the cluster. You have to specify its type, even if you just want to use the standard type.

 

Then you specify the workerType and the workerCount. Here are the available options for the machine type. They have various quantities of CPUs, GPUs, and memory. I just put the standard machine type for all of the nodes.

 

Next, you specify the type and number of parameter servers. The parameter servers take care of the shared model state between the workers. You don’t have to set this, but it will really slow down your job if you don’t have any. The STANDARD_1 tier has 3 parameter servers, so we might want to double that, because we’ve doubled the number of workers, but have a look at the CPU utilization of the parameter servers in the STANDARD_1 job we ran. One of them has quite a high utilization and the other two are pretty low, so maybe we don’t need to double the number. Let’s try it with 3 again.

 

OK, that’s all you need. I should point out that YAML won’t accept tabs at the beginning of a line, so make sure they’re spaces.

 

Now go back into the cnn-mnist directory. Before we enter the command to run the job, we need to set the job name to something different from the last time or it’ll fail.

 

Now, to run the job with this custom tier, all you have to do is set the scale tier to CUSTOM and then add “--config custom.yaml”. This assumes that the YAML file is in the current directory.

 

The previous job took thirteen and a half minutes. This job was much quicker because we used twice as many workers. But what’s interesting is that it only consumed a little bit more in ML Units, so it didn’t cost much more to run this job than the previous one. It certainly didn’t cost twice as much. I’ve even seen it consume slightly fewer ML Units before. That’s because this job scaled really well with more workers. So, adding more workers can be a great way to run your training jobs more quickly, without having to pay much, if any, more.

 

To verify how well your job scaled, you can have a look at the CPU utilization graph for the workers in the job. All 8 workers had roughly the same CPU utilization, so the job parallelized very well. If a job doesn’t parallelize well, then you’ll see much lower utilization for many of the workers.

 

Also notice that the parameter servers didn’t seem to parallelize any better than before. One of them has a very high CPU utilization and the other ones are fairly low. I’ve tried running this job with only one parameter server and not only did it use fewer ML Units, but it didn’t take any longer to run. Try it yourself and see if you get the same results.

 

By the way, I’ve occasionally seen a job keep running long after it went through all 20,000 steps. If that happens to you, then stop the job.

 

Another scaling option is to use GPUs. Since GPUs are designed to perform many mathematical operations in parallel, they can be ideal for training machine learning models. Unfortunately, it’s not usually as simple as just using GPUs instead of CPUs. You’ll often need to use some combination of them and also specify the architecture used for your training job.

 

Here’s an example of some code that says to use a CPU for the input pipeline because that operation would be slower on a GPU. The way you specify a CPU or a GPU is with the tf.device function.

 

You don’t have to explicitly say which operations should be on a GPU, but if you don’t, you may get poor results. For example, I ran the same MNIST job using 8 standard GPU nodes, but it took longer and used three times as many ML Units. Google recommends using GPUs for large models with many mathematical operations. And even then, you should run some small tests first to make sure GPUs are well suited to the task.

 

Google has also developed a more cutting-edge option: Tensor Processing Units or TPUs. These chips were designed by Google to accelerate TensorFlow jobs. This is a game changer for machine learning, but it’s going to take a while to mature. Support for TPUs is still being added to ML Engine, so you have to use Compute Engine instances to take advantage of them in the meantime. You also have to write TPU-specific code.

 

And that’s it for scale options.

About the Author

Students12959
Courses41
Learning paths20

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).