1. Home
  2. Training Library
  3. Google Cloud Platform
  4. Courses
  5. Building Convolutional Neural Networks on Google Cloud

Training a CNN

The course is part of these learning paths

Machine Learning on Google Cloud Platform
course-steps 2 certification 1

Contents

keyboard_tab
Introduction
Convolutional Neural Networks
Improving a Model
Scaling
7
Scaling4m 50s
Conclusion
8
play-arrow
Start course
Overview
DifficultyAdvanced
Duration38m
Students534

Description

Course Description

Once you know how to build and train neural networks using TensorFlow and Google Cloud Machine Learning Engine, what’s next? Before long, you’ll discover that prebuilt estimators and default configurations will only get you so far. To optimize your models, you may need to create your own estimators, try different techniques to reduce overfitting, and use custom clusters to train your models.

Convolutional Neural Networks (CNNs) are very good at certain tasks, especially recognizing objects in pictures and videos. In fact, they’re one of the technologies powering self-driving cars. In this course, you’ll follow hands-on examples to build a CNN, train it using a custom scale tier on Machine Learning Engine, and visualize its performance. You’ll also learn how to recognize overfitting and apply different methods to avoid it.

Learning Objectives

  • Build a Convolutional Neural Network in TensorFlow
  • Analyze a model’s training performance using TensorBoard
  • Identify cases of overfitting and apply techniques to prevent it
  • Scale a Cloud ML Engine job using a custom configuration

Intended Audience

  • Data professionals
  • People studying for the Google Certified Professional Data Engineer exam

Prerequisites

This Course Includes

  • Many hands-on demos

Resources

The github repository for this course is at https://github.com/cloudacademy/ml-engine-doing-more.



Transcript

Now it’s time to see how this convolutional neural network performs. We’re going to start by running the script locally before we run it on ML Engine.

 

I just want to point out a couple of things first. Because the MNIST dataset is so commonly used for learning TensorFlow, they’ve made it really easy to load it. These lines load the training and eval data, including the images and the labels, and put them into NumPY arrays. The nice thing about this is that the script is self-contained, without any need for loading data yourself.

 

I’ve put a version of this script in the github repository for this course. You can find a link to the repository at the bottom of the course overview below this video.

 

I’ve only made minor changes to the script. One change is here. This deletes the directory where it saves the model. Without this line, you’d have to remember to remove the directory every time you wanted to run this script again. Otherwise, it would use the results of your last run as a starting point for your new training run.

 

Since there’s no setup required for this script, you can just run it. If you’re running this on a Mac, you may get lots of warning messages like these ones, but you can just ignore them. I’ll fast forward to when it’s done.

 

There. It got an accuracy of .70. That’s pretty good, but we can do better. I didn’t mention this before, but I changed the number of steps in this script from 20,000 to 1,000, because 20,000 steps can take hours to run on a laptop.

 

Instead of running it on your laptop with that many steps, this would be a good opportunity to run it on ML Engine, since it’s designed to scale. However, as you know, we can’t just take a simple script like this and run it in a distributed fashion. We need to use some special functions to make that work.

 

Google has some example code that trains this MNIST model in a distributed fashion, but it was written for TensorFlow version 1.2. Version 1.4 has a new function called train_and_evaluate that makes it much easier than what was available in the previous version. So I adapted the code to use this new function.

 

You can find the code in the github repository for this course. It’s called cnn-mnist. It has a directory called trainer that contains two script files: task.py and model.py. Task.py contains the main method and the train_and_evaluate method. It also supports many command line arguments. In model.py, you’ll find the same model code that we used in the simpler script.

 

Recall that to run this on ML Engine, you need to have an __init__.py file in the same directory, even if the file is empty.

 

OK, before we run it, we need to do quite a bit of setup. You can copy these commands from the github readme file. First, we’ll set a bunch of environment variables. This is a handy command to set the PROJECT variable to your Google Cloud project ID. You don’t have to use this, of course. You could just set it to your project ID manually, if you’d like.

 

Then set the bucket name to your project ID dash ml. Next, set the data directory to $BUCKET/data/. Then set the region to whichever region is closest to you and that supports ML Engine jobs. I’ll use us-central1. Finally, set the job name to mnist_dist1.

 

Alright, now this version of the script doesn’t load the MNIST data automatically, so we have to get it manually. It does include a script that retrieves it, though, so go into the cnn-mnist directory and run “python scripts/create_records.py”. This creates a number of files in /tmp/data. It takes a minute or so to run, so I’ll fast forward.

 

In order for ML Engine to use these data files, we need to upload them to a Cloud Storage bucket. If you took my introductory ML Engine course, then you may have already created a bucket with the name of your project dash ml. If you didn’t already create it, or if you deleted it, then you can create it with this command.

 

Now you can copy the data files into the bucket. Use “gsutil cp” and then the name of the file to copy and then “$DATA_DIR”. Then do the same for the file with the test data in it.

 

OK, now we can finally run it. Here’s the command. You’ll definitely want to copy this one from the readme file.

 

We’re using the STANDARD_1 scale tier, which means it’ll deploy 4 workers. But it will still take quite a while, so I’ll fast forward.

 

Alright, it took 13 and a half minutes, which is a lot quicker than it would have been on my laptop. To see the accuracy, click “View logs”. The accuracy was about .96, so running it for 20,000 steps instead of 1,000 made a big difference, since we only had an accuracy of .72 before.

 

And that’s it for this lesson.

About the Author

Students14155
Courses41
Learning paths22

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).