1. Home
  2. Training Library
  3. Google Cloud Platform
  4. Courses
  5. Introduction to Google AI Platform

Feature Engineering


Training Your First Neural Network
12m 40s
Improving Accuracy
8m 4s
Start course
1h 3m

Machine learning is a hot topic these days and Google has been one of the biggest newsmakers. Google’s machine learning is being used behind the scenes every day by millions of people. When you search for an image on the web or use Google Translate on foreign language text or use voice dictation on your Android phone, you’re using machine learning. Now Google has launched AI Platform to give its customers the power to train their own neural networks.

This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.

Learning Objectives

  • Describe how an artificial neural network functions
  • Run a simple TensorFlow program
  • Train a model using a distributed cluster on AI Platform
  • Increase prediction accuracy using feature engineering and hyperparameter tuning
  • Deploy a trained model on AI Platform to make predictions with new data



  • December 20, 2020: Completely revamped the course due to Google AI Platform replacing Cloud ML Engine and the release of TensorFlow 2.
  • Nov. 16, 2018: Updated 90% of the lessons due to major changes in TensorFlow and Google Cloud ML Engine. All of the demos and code walkthroughs were completely redone.

The iris dataset is great for learning the basics of neural networks, but it’s far simpler than most datasets that you’ll use machine learning on. It only has four features and all of them are decimal numbers. Let’s have a look at a more complicated dataset.

It’s called the PetFinder dataset. It includes records of thousands of stray dogs and cats in Malaysia. It has fields for characteristics like breed, gender, and color. The label is how long it took for these pets to be adopted. A ‘4’ means that it was never adopted. The goal for our model will be to use the other features to predict whether a pet was adopted or not. This is a simpler classification than predicting the adoption speed.

Most of the columns are pretty self-explanatory, but in case you’re wondering, PhotoAmt is the number of photos that were uploaded for this pet. With this many features, the first thing we should do is decide which ones to include in our model. This is known as feature selection. Looking through these features, it seems like all of them might be useful. For example, people would probably prefer a healthy pet and one that doesn’t have an adoption fee. However, the description field is an exception. Although it probably contains lots of valuable information, it’s freeform text, so it would be difficult to use in our model. We’ll include all of the features except the description.

Here’s a list that shows what type each column is, either numerical or categorical (other than Description, which is text, and AdoptionSpeed, which is a classification because it’s the label field.

Most of the columns in the dataset are categorical, which means that their values are categories. For example, the fur length column contains three categories: short, medium, and long.

Let’s see how to deal with these features using TensorFlow. The code is in the pets/trainer directory in the GitHub repository for this course. 

The numeric features are easy to deal with, but the categorical features require a bit more work. Take the MaturitySize field, for example. How would you convert “Medium” into something a neural network could use? One way would be to just assign a number to each category.

You could have an input node in your network that feeds in that number, and it might even be useful for making predictions. For example, suppose that people generally prefer smaller pets. Then the lower the size number, the higher the likelihood that the pet would be adopted. I’m not saying that’s actually the case, but there could be a correlation like that between size and adoption speed.

However, that approach wouldn’t work so well with a column like “Color1”, would it? How would you put the colors in numerical order in such a way that it would help our model learn? There probably is some sort of relationship between pet color and adoption speed, but it’s probably not straightforward, and that’s exactly the sort of relationship that we want our neural network to figure out, not one that we would tell it in advance.

A better way to handle categories is to make each one a separate feature. Rather than having to define each one of them separately, though, which would take a lot of tedious coding, you just specify which columns are categorical, and TensorFlow will turn them into multiple features.

The “categorical_column_with_vocabulary_list” method will do this for you. Normally, you would have to create a list of all the possible categories for each feature and pass it to the method, but there’s an easier way to come up with the list. The “unique” method lists all of the unique strings in a column, which is what we need.

This loop turns each of the columns listed above into an indicator column. To create a separate feature for each category, it uses a one-dimensional array, or vector, to represent the categories. Each number in the vector stands for a category, and it can be either a 0 or a 1.

In the pets dataset, the simplest categorical column is type because it only has two possible values: dog and cat.

Using what’s known as one-hot encoding, a dog is represented as [1, 0] because the first number is the dog category, and a cat is represented as [0, 1] because the second number is the cat category.

It’s called one-hot encoding because only one of the values is a 1 and the rest are zeroes. For example, if you had a column with 10 categories, then the third category would look like this: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0].

One-hot encoding is a simple way to turn a categorical column into multiple features, but it isn’t always a good solution. It has two problems. First, it has high dimensionality, meaning that instead of having just one value, like a numeric feature, it has many values, or dimensions. If a column had a very large number of categories, then the vector would be huge, which would make calculations take too long.

The second problem is that it doesn’t encode any relationships between the categories. They are completely independent from each other, so the network has no way of knowing which ones are similar to each other.

Both of these problems can be solved by representing a categorical feature with an embedding column. The idea is that each category has a smaller vector with, let’s say, 5 values in it. But unlike a one-hot vector, the values are not usually 0. The values are weights, similar to the weights that are used for basic features in a neural network. The difference is that each category has a set of weights (5 of them in this case).

You can think of each value in the embedding vector as a feature of the category. So, if two categories are very similar to each other, then their embedding vectors should be very similar too.

The optimizer continually adjusts the embedding weights just like it adjusts the regular node weights, so it learns which weights are the best fit for the data you give it.

For example, suppose that the type column included lots of other animals besides just dogs and cats, such as various kinds of mammals, birds, and reptiles. With such a large number of categories, we might want to use an embedding column. As the network adjusts the weights for the embedding vectors, it might learn that dogs and cats are quite similar, and that they share three things in common. They both have fur, four legs, and a tail. You don’t tell the model what similar characteristics to look for between categories. It figures them out on its own by adjusting the weights in the embedding vector until similar animals have similar effects on the output of the network.

Now it should be clear how embedding columns solve both problems of one-hot columns. First, they perform what’s called dimensionality reduction. That is, they reduce the number of dimensions from a potentially very large number in a one-hot vector to a smaller number, such as 5, in an embedding vector. Second, they provide a way for the network to learn similarities between categories in a feature, which can help the network make generalizations.

Okay, so we’re going to turn Breed1 into an embedding column because there are many breeds in the dataset, and it would be good to have the model find relationships between them. We’re going to use a dimension of 8 for the embedding vectors.

Now that we’ve converted all of the columns into features, we can consider creating new features based on existing ones. This is one of the core activities of what’s known as feature engineering. Let’s see if we can create a new feature from the numeric ones.

Sometimes the relationship between a numeric column and the target feature you’re trying to predict is linear, such as between square footage and home value. Generally speaking, the higher the square footage, the higher the home value. In other cases, though, the relationship is not so simple.

For example, what do you think is the relationship between a pet’s age and how quickly it’s adopted? One possibility is that as a pet gets older, it takes longer to be adopted. This would be a simple linear relationship that would be easy to model. However, we don’t know if that’s actually the case. There might be a peak age, such as 6 months, at which pets are adopted the quickest. There’s no way to model a falling and then rising time with a linear relationship, so we would have to change the age feature somehow to allow the network to model this sort of relationship properly.

The way to do that is to convert age from a numeric feature to a categorical feature. One approach would be to divide pets into an under-6-month category and an over-6-month category. Then the neural network could treat these as separate features and discover different relationships between them and their adoption rates. But since we don’t know whether 6 months is the right dividing point, it would make more sense to create a number of different age ranges.

In this script, there are 5 age ranges, based on these boundaries. The first range is everything before the first boundary, so it would be ages 0 to 5 months. The next range is 6 to 11 months, and so on. The last range is 24 months and older.

The function that does this conversion is called “bucketized_column” because TensorFlow refers to these ranges as buckets.

Another way to create new features is to combine existing categorical features. For example, maybe people prefer different ages for dogs and cats. To test that theory, we could create a new feature that’s a combination of animal type and age buckets. To do this in code, you need to create what’s called a crossed_column. This combines two columns to create a new column with all of the possible combinations.

When you cross two columns, the number of combinations could be huge, which could create problems. That’s not the case with animal type and age buckets in this dataset because they have a small number of categories, but if we were combining columns that had a lot of categories, the number of combinations would be very large. This is a similar problem to what we saw with one-hot encoding. To reduce the dimensionality of the new feature, we need to encode the categories in a different way. In this case, we need to create a hash table.

Here’s how it works. For each combination of animal type and age buckets, it assigns an ID to the category by using a hash function. If you create a hash table that has a smaller number of possible values than the number of combinations for the new feature, then you’ve reduced the dimensionality of the feature. Of course, that means that more than one category will have the same ID, so the model will treat them as being the same category, even though they’re not. But that usually doesn’t affect the usefulness of the crossed feature too much. I’ve set the size of the hash table to 5 with the hash_bucket_size parameter. You wouldn’t normally have such a small hash table, but this is just for demonstration purposes.

Alright, let’s run this script to see how well it does. First, we need to install some Python libraries that it needs.

Now go into the pets directory. Then copy this command from the readme file. It’ll take about 30 seconds, so I’ll fast forward to when it’s done. Okay, the final accuracy was about 74%, which isn’t as good as the iris accuracy, but this is a much more complex problem to model.

Are you curious about how much the engineered features improved the accuracy? You can find out by commenting out the bucketized age and the crossed column. Now save the file and run it again.

It comes to about 74%. So, it looks like the engineered features improved the accuracy by about half a percent, although that’s not necessarily the case. Why? Well, every training run is different because the initial weights are set randomly, so the accuracy varies even if you don’t change the code. Regardless, the engineered features don’t seem to have had a big effect on the accuracy of the model.

Quite often, it takes a lot of experimentation with different features to get significant improvements, but for some problems, even a small improvement is well worth the time it takes. For others, it’s not. It all depends on what you’re trying to achieve, which is why you should decide before you get started what is an acceptable accuracy rate, so you don’t spend a huge amount of time trying to squeeze a bit more accuracy out of your model.

One thing I didn’t mention is that I didn’t put any hidden layers in the model, so it’s not a deep neural network. It’s what’s known as a linear model rather than a deep model. The advantage of deep neural networks is that they can often “discover” features and relationships between features all by themselves. When it works, you don’t have to do as much feature engineering manually because the neural net will essentially do the feature engineering for you. However, that doesn’t always work.

Let’s add a couple of hidden layers to see if it performs better than our last run. Okay, it’s done. The accuracy is still pretty close to what it was before, so it looks like the hidden layers didn’t make much difference. For most real-world problems, you’ll need to experiment with both feature engineering and deep networks to find a combination that works well.

Before we move on, you should undo the changes you made to the script and save it again. I’m assuming your editor has an undo feature.

And that’s it for this lesson.

About the Author
Learning Paths

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).