Training Your First Neural Network
Scaling Up with ML Engine
The course is part of these learning paths
Machine learning is a hot topic these days and Google has been one of the biggest newsmakers. Recently, Google’s AlphaGo program beat the world’s No. 1 ranked Go player. That’s impressive, but Google’s machine learning is being used behind the scenes every day by millions of people. When you search for an image on the web or use Google Translate on foreign language text or use voice dictation on your Android phone, you’re using machine learning. Now Google has launched Cloud Machine Learning Engine to give its customers the power to train their own neural networks.
If you look in Google’s documentation for Cloud Machine Learning Engine, you’ll find a Getting Started guide. It gives a walkthrough of the various things you can do with ML Engine, but it says that you should already have experience with machine learning and TensorFlow first. Those are two very advanced subjects, which normally take a long time to learn, but I’m going to give you enough of an overview that you’ll be able to train and deploy machine learning models using ML Engine.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
- Describe how an artificial neural network functions
- Run a simple TensorFlow program
- Train a model using a distributed cluster on Cloud ML Engine
- Increase prediction accuracy using feature engineering and both wide and deep networks
- Deploy a trained model on Cloud ML Engine to make predictions with new data
- The GitHub repository for this course is at https://github.com/cloudacademy/mlengine-intro.
- Nov. 16, 2018: Updated 90% of the lessons due to major changes in TensorFlow and Google Cloud ML Engine. All of the demos and code walkthroughs were completely redone.
In the last lesson, we ran the script with the model_type set to wide. That means we didn’t use any hidden layers. It wasn’t a deep neural network. The model had a large number of features all in one row, many of which were combinations of features. This sort of model is good for memorizing specific feature interactions that work well. But because it doesn’t have any hidden layers, it can’t come up with any generalizations about feature combinations that don’t appear in the training data. That’s where deep models shine.
A team of Google researchers came up with a way to combine the strengths of the two models into one, the wide and deep model.
To use it, you have to define the features you want in the wide model and the features you want in the deep model. In the last lesson, we went through the feature definitions for the wide model, so now we just need to do the deep model.
As usual, the continuous columns are easy to work with. It’s the categorical columns that will need to be transformed again.
Remember how deep networks can learn about interactions between features? Well, there’s a trick you can use to get them to learn about similarities between categories within a feature, too. For example, two of the occupation categories are Exec-managerial and Prof-specialty, that is, professional specialty. These two categories are likely similar in how they interact with other features to predict income, but there’s no way for the network to learn that.
There are two different ways to represent a categorical feature that the DNNClassifier will accept. The first way is called indicator_column. This is a one-dimensional array, or vector. Each number in the vector stands for a category and it can be either a 0 or a 1.
In the census data, the simplest categorical column is gender because it only has two possible values. I hope I’m not offending anyone by showing only two genders, but that’s how many there are in the census data, so that’s what we have to work with.
Using what’s known as one-hot encoding, a female is represented as [1, 0] because the first number is the female category, and a male is represented as [0, 1] because the second number is the male category.
It’s called one-hot encoding because only one of the values is a 1 and the rest are zeroes. For example, if you had a column with 10 categories, then the third category would look like this: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0].
There are two problems with one-hot encoding. First, it has high dimensionality, meaning that instead of having just one value, like a continuous feature, it has many values, or dimensions. This makes computation more time-consuming, especially if a feature has a very large number of categories. The second problem is that it doesn’t encode any relationships between the categories. They are completely independent from each other, so the network has no way of knowing which ones are similar to each other.
Both of these problems can be solved by representing a categorical feature with an embedding column. The idea is that each category has a smaller vector with, let’s say, 5 values in it. But unlike a one-hot vector, the values are not usually 0. The values are weights, similar to the weights that are used for basic features in a neural network. The difference is that each category has a set of weights (5 of them in this case).
You can think of each value in the embedding vector as a feature of the category. So, if two categories are very similar to each other, then their embedding vectors should be very similar too.
The DNNClassifier continually adjusts the embedding weights just like it adjusts the regular node weights, so it learns which weights are the best fit for the data you give it.
For example, suppose you have a feature column called “type of animal” and two of the possible categories for that feature are dog and cat. As the network adjusts the weights for these two embedding vectors, it learns that dogs and cats are quite similar, and that they share three things in common. They both have fur, four legs, and a tail. You don’t tell the model what similar characteristics to look for between categories. It figures them out on its own by adjusting the weights in the embedding vector until similar animals have similar effects on the output of the network.
Now it should be clear how embedding columns solve both problems of one-hot columns. First, they perform what’s called dimensionality reduction. That is, they reduce the number of dimensions from a potentially very large number in a one-hot vector to a smaller number, such as 5, in an embedding vector. Second, they provide a way for the network to learn similarities between categories in a feature, which can help the network make generalizations.
OK, now you can see why this script defines these features using the embedding_column function. Note that the default in this script is to use a dimension of 8 for both of the embedding vectors.
OK, now let’s run the script using the deep model. Use the same command as before but change the model_type to deep”. This time, it’ll ignore the wide model definition and just run the deep model. The accuracy is about 83.2%. That’s a bit higher than what it was for the wide model, which was about 82.8%. So, is that as high as we can get? Let’s try combining the two and see what happens.
Remember when I mentioned that a team of Google researchers came up with a way to combine the two models? Their idea is that the wide model can memorize specific feature interactions and the deep model can make generalizations about categories, so when you combine the two, you get the best of both worlds. They generously open-sourced their implementation, so now you can run a wide and deep model by using the DNNLinearCombinedClassifier.
If you run the script again without putting in a model type, it will default to running the wide and deep model.
This time the accuracy is about 83.3%, which is a tiny improvement over the 83.2% we got before. It probably doesn’t do better than that because this is a relatively small dataset. The wide and deep model does result in a bigger improvement for some tasks, though, such as recommendation systems. For example, the research team achieved good results using it to recommend apps on Google Play.
One of the reasons I went through the wide and deep model with you is because that’s what the example code for ML Engine uses, which I’ll show you in the next lesson.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).