1. Home
  2. Training Library
  3. Google Cloud Platform
  4. Courses
  5. Building Convolutional Neural Networks on Google Cloud

Building a CNN in TensorFlow


Convolutional Neural Networks
Improving a Model
4m 50s
Start course

Once you know how to build and train neural networks using TensorFlow and Google Cloud Machine Learning Engine, what’s next? Before long, you’ll discover that prebuilt estimators and default configurations will only get you so far. To optimize your models, you may need to create your own estimators, try different techniques to reduce overfitting, and use custom clusters to train your models.

Convolutional Neural Networks (CNNs) are very good at certain tasks, especially recognizing objects in pictures and videos. In fact, they’re one of the technologies powering self-driving cars. In this course, you’ll follow hands-on examples to build a CNN, train it using a custom scale tier on Machine Learning Engine, and visualize its performance. You’ll also learn how to recognize overfitting and apply different methods to avoid it.

Learning Objectives

  • Build a Convolutional Neural Network in TensorFlow
  • Analyze a model’s training performance using TensorBoard
  • Identify cases of overfitting and apply techniques to prevent it
  • Scale a Cloud ML Engine job using a custom configuration

Intended Audience

  • Data professionals
  • People studying for the Google Certified Professional Data Engineer exam



The GitHub repository for this course is at https://github.com/cloudacademy/ml-engine-doing-more.


Although convolutional neural networks are fairly complex, they’re relatively easy to build in TensorFlow if you use the tf.layers module. It provides a high-level API and takes care of the implementation details behind the scenes.


I’ll take you through some sample code from the TensorFlow website that learns how to recognize the digits 0 through 9 in images from the classic MNIST dataset. Each image is 28 by 28 pixels in size and there are 55,000 of them in the dataset. They are all monochrome images, so they don’t have any color data. Each pixel is represented by a grayscale value between 0 and 1.


In the introductory course, we only used predefined models, such as DNNClassifier, but now we need to build a custom model, so we have to use the more generic Estimator class. It gives you a foundation for building your model, but you have a lot of freedom in what you can put in it.


Here’s where we define the model. Before we can build any of the layers, we have to get the input data into the right format. The tf.layers methods expect the input data to be in a tensor of batch size, width, height, and number of channels. The width and the height refer to the number of pixels in the input images, which is 28 by 28, in this case. The number of channels refers to the number of color channels. Since these are monochrome images, there is only 1 color channel.


The batch size says how many images it should run through the model before calculating the average error, also known as the loss, and adjusting the weights to reduce the loss. It’s set to -1 here because that’s a parameter you can set differently each time you create an instance of the model. In other words, it’s a hyperparameter, so you don’t want to hardcode the batch size here.


Now that the input data is in the right format, we can build the layers, starting with the first convolutional layer. It’s very simple. You just need to call tf.layers.conv2d and give it a few parameters. After giving it the input data, you specify how many filters to create. Remember, filters are basically feature detectors. We’re going to create 32 of them.


Next, we tell it the kernel size. Kernel is another name for filter. So we’re saying to use a 5-by-5 weights matrix to apply to the image. Then we specify the padding. This refers to the zero-padding around the border of the image. When you set it to “same”, it means to add the amount of padding necessary to make the output matrices (that is, the feature maps) the same size as the input image. Finally, we tell it to use the ReLU activation function. Recall that this is the non-linear function to apply after doing the convolution.


Now we create a pooling layer. Notice that it’s called “max_pooling”. That’s because we want it to select the biggest value in each window when it does the pooling. Alternatively, you could get it to take the average of the values in each window instead of the maximum value, but max pooling tends to work better in practice.


First, you tell it to use the convolution layer as input. Then you tell it the pool size, which is a 2-by-2 window, in this case, and that the stride length is 2. Recall that this means the feature maps output by this pooling layer will be one-quarter the size of the ones that came out of the convolution layer.


Now we add a second convolutional layer, but this time, we create 64 filters instead of 32. Then we add another pooling layer that’s exactly the same as the first one.


Now it’s time to add a fully connected layer, but first, we have to flatten the tensor from the pooling layer. We need to take all 64 of the 7-by-7 feature maps and turn them into a single row of neurons. Then we feed that into tf.layers.dense (dense is another name for fully connected) and tell it to create 1,024 neurons. Each of those neurons will be connected to every neuron in the previous layer, which has over 3,000 neurons. This is why it’s called a dense layer. There are over 3 million connections between the previous layer and this one. You also have to tell it to use the ReLU activation function.


Finally, we create one more dense layer that has one neuron for every potential object classification. Since we are classifying each image as one of the digits from 0 through 9, we only need 10 neurons in this layer. Each of these 10 neurons is connected to all 1,024 of the neurons in the previous layer.


Now there’s a bit of code to interpret the results from the last layer. We get the prediction for which digit is in the image by seeing which of the 10 neurons has the highest value. So, for example, if the neuron with the highest value is the one that represents an 8, then the model is predicting that the digit in the image is an 8.


Since the model will never be completely certain that it’s an 8, we also list the probability that it’s a 0, the probability that it’s a 1, etc. We do that using the softmax function, which simply squashes each of the values into a number between 0 and 1, such that they all add up to 1.


You need more code to actually train and evaluate the model, of course, but it’s not unique to CNNs, so I’m not going to go through it here.


In the next lesson, we’ll run this model.

About the Author
Learning Paths

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).