1. Home
  2. Training Library
  3. Google Cloud Platform
  4. Courses
  5. Building Convolutional Neural Networks on Google Cloud

Convolutional Neural Networks


Convolutional Neural Networks
Improving a Model
4m 50s
Convolutional Neural Networks

Once you know how to build and train neural networks using TensorFlow and Google Cloud Machine Learning Engine, what’s next? Before long, you’ll discover that prebuilt estimators and default configurations will only get you so far. To optimize your models, you may need to create your own estimators, try different techniques to reduce overfitting, and use custom clusters to train your models.

Convolutional Neural Networks (CNNs) are very good at certain tasks, especially recognizing objects in pictures and videos. In fact, they’re one of the technologies powering self-driving cars. In this course, you’ll follow hands-on examples to build a CNN, train it using a custom scale tier on Machine Learning Engine, and visualize its performance. You’ll also learn how to recognize overfitting and apply different methods to avoid it.

Learning Objectives

  • Build a Convolutional Neural Network in TensorFlow
  • Analyze a model’s training performance using TensorBoard
  • Identify cases of overfitting and apply techniques to prevent it
  • Scale a Cloud ML Engine job using a custom configuration

Intended Audience

  • Data professionals
  • People studying for the Google Certified Professional Data Engineer exam



The GitHub repository for this course is at https://github.com/cloudacademy/ml-engine-doing-more.


Convolutional Neural Networks is quite a mouthful. They’re more complex than the neural networks I showed in the introductory course, too.


CNNs are very good at certain tasks, especially recognizing objects in pictures and videos. In fact, it’s one of the technologies powering self-driving cars. I’ll start with a high-level overview of what a CNN does before going into the details.


When you’re building a machine learning model to recognize objects in images, one of the biggest problems is the huge amount of information. To analyze an image, you need to process every pixel of it. The simplest model would just make each pixel an individual feature. This approach can be surprisingly effective. For example, on the classic MNIST dataset of handwritten single-digit numbers, this simple model can achieve a 92% accuracy rate.


To do better than that, though, you need to use a different approach. You need a model that looks at groups of pixels together and tries to detect higher-level features. For example, if your model can see that an area of pixels looks like a nose, then it’s a lot easier to recognize it as part of a face. That’s what CNNs do.


With a “flat” pixel representation, your model would apply random weights to each of these pixels and then adjust the weights on every pass. With a convolutional model, you’d leave the pixels in a matrix format, and the model would apply a smaller matrix of weights to different sections of the image. It would multiply the weights against the pixel values, add them up, and store the total for that section. Then it would slide the weights matrix over by one pixel and do the same thing.


With a 5-by-5 image, this process would result in a 3-by-3 matrix of totals, known as a feature map. The idea is that this summary of pixels could detect image features, such as straight lines. Of course, your model has to adjust these weights so that they produce feature maps that can detect something like a straight line. I realize that it’s kind of hard to visualize how a feature map can show useful information, so later on I’ll show you a visualization of what this process produces.


It’s often useful to make the feature map the same size as the original image. There’s an easy way to do that. You simply add a border of extra pixels around the image before you perform the convolution. These pixels are given a value of 0, so this is called zero-padding. Then the feature map ends up being the same size as the image.


Detecting a single, high-level feature, such as a line, is great, but that won’t give you enough information to recognize a complex picture, so you need to find a way to detect lots of different high-level features. The way to do that is to apply a whole bunch of weight matrices against the image. These are called filters because each one will filter the image in a specific way to look for a specific high-level feature. These filters start off with random weights, but through the machine learning process, the weights get adjusted to produce useful filters as the model learns how to recognize objects.


Here’s a visualization created by researcher Rob Fergus that shows two different filters being applied to an image. You can see that the resulting feature maps are essentially simplified versions of the original image. They’re slightly different from each other because they’re looking for different things.


The convolutional layer performs one more operation as well. It applies an activation function to every neuron (that is, every cell) in each feature map. Activation functions are an important concept in neural networks. An activation function potentially changes the output of a neuron. In the early days of neural nets, they typically used a sigmoid function, which takes the output of a neuron and squashes it down into a value between 0 and 1.


Unfortunately, researchers discovered that the sigmoid function creates a fundamental problem in neural networks, so it’s usually not used anymore. Instead, most neural networks use the ReLU function, which stands for rectified linear unit. The name makes it sound more complicated than it is. It’s actually a very simple function. If the output of a neuron is negative, then it changes it to a 0. If it’s positive, then it doesn’t change it. That’s all it does.


So, why on earth would that be a useful function in a neural net? Well, it’s because it’s a nonlinear function. The real world is full of nonlinear things, so if you want to model them, you need to add some non-linearity.


After applying all of these filters, you’ll usually want to reduce the amount of information, while still retaining the features discovered by the filters. This is known as dimensionality reduction because you’re reducing the number of dimensions in the data. We do this with something called a pooling layer.


It’s actually kind of similar to a convolutional layer. We look at a small section (2 by 2 in this case) of each feature map, but instead of doing a multiplication against weights, we just choose the largest value in this mini-matrix. Then we slide the window over and do it again. We normally move it over by more than one cell. The number of cells it slides over is called the stride, and it’s typically set to 2. Using a stride of 2 cuts the size of the resulting feature map by 2 across and 2 down, so the new map is only one-quarter the size of the old one.


With these layers, our model should be able to learn to detect a number of low-level features, such as lines, in the input images. To get it to learn higher-level features, such as squares, that are a combination of low-level features, we would add another convolution layer and another pooling layer. You might even do this many times if you have complex image recognition requirements.


Now, at this point, we have lots of feature maps that aren’t connected to each other, so we need to add a layer that looks at all of them. Then it can make a prediction about what’s in the image. We do that with what’s called a fully connected layer. This is a layer where every neuron is connected to every neuron in the previous layer of filter maps.


In a simple model, you’d connect every neuron in the prediction layer to every neuron in the final pooling layer. In the prediction layer, you’d have one neuron for every type of object you wanted to detect in your images. For example, if you only wanted to detect dogs, cats, hills, and birds, then you’d have four neurons in the prediction layer, one for each type of object. If you wanted to detect a thousand different types of things, then you’d put a thousand neurons in this layer.


However, you’ll usually see another fully connected layer added before the prediction layer. This fully connected layer would have more neurons than the prediction layer, but fewer than the pooling layer. The idea is that it would aggregate features from the pooling layer, which would improve the accuracy of the prediction layer.


That’s it for what convolutional neural networks do. In the next lesson, I’ll show you how to build one.

About the Author
Learning Paths

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).