1. Home
  2. Training Library
  3. Google Cloud Platform
  4. Courses
  5. Introduction to Google Cloud Machine Learning Engine

Feature Engineering

The course is part of these learning paths

Machine Learning on Google Cloud Platform
course-steps 2 certification 1

Contents

keyboard_tab
Introduction
1
Introduction
FREE1m 26s
Training Your First Neural Network
3
Improving Accuracy
Conclusion
play-arrow
Start course
Overview
DifficultyIntermediate
Duration55m
Students1152
Ratings
4.9/5
star star star star star-half

Description

Machine learning is a hot topic these days and Google has been one of the biggest newsmakers. Recently, Google’s AlphaGo program beat the world’s No. 1 ranked Go player. That’s impressive, but Google’s machine learning is being used behind the scenes every day by millions of people. When you search for an image on the web or use Google Translate on foreign language text or use voice dictation on your Android phone, you’re using machine learning. Now Google has launched Cloud Machine Learning Engine to give its customers the power to train their own neural networks.

If you look in Google’s documentation for Cloud Machine Learning Engine, you’ll find a Getting Started guide. It gives a walkthrough of the various things you can do with ML Engine, but it says that you should already have experience with machine learning and TensorFlow first. Those are two very advanced subjects, which normally take a long time to learn, but I’m going to give you enough of an overview that you’ll be able to train and deploy machine learning models using ML Engine.

This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.

Learning Objectives

  • Describe how an artificial neural network functions
  • Run a simple TensorFlow program
  • Train a model using a distributed cluster on Cloud ML Engine
  • Increase prediction accuracy using feature engineering and both wide and deep networks
  • Deploy a trained model on Cloud ML Engine to make predictions with new data

Resources

Updates

  • Nov. 16, 2018: Updated 90% of the lessons due to major changes in TensorFlow and Google Cloud ML Engine. All of the demos and code walkthroughs were completely redone.

Transcript

The iris dataset is great for learning the basics of neural networks, but it’s far simpler than most datasets that you’ll use machine learning on. It only has four features and all of them are decimal numbers. Let’s have a look at a more complicated dataset. It comes from the 1996 US census and it includes lots of different features about each person in it. Here are the columns.

 

The target for this dataset is the last column, income. The goal for our model will be to use the other features to predict whether a person makes more or less than $50,000.

 

One common question is, “What the heck is the fnlwgt” column? It looks like it might stand for “final weight”, but people don’t have to give their weight in a census, and I don’t know what units that weight would be in. Maybe grams? Well, it’s actually a statistical number that means how many people in the US population are represented by this record. So, for the first record, there are about 77,000 people who have similar characteristics to this person. It’s a little more complicated than that, but we don’t need to know the details for our purposes here.

 

With so many features, we need to decide which ones to use in our model and whether or not we can modify any of them or extract new features from the existing ones. That can take a lot of thought and experimentation, but one thing we can be pretty sure of is that we don’t need to include the final weight feature.

 

Here’s a list that shows what type each column is, either continuous or categorical. Continuous means that it can be any numeric value in a range. For example, age can be any integer from 0 to 122 (since that’s how old the oldest person who ever lived was). The columns in the iris dataset are all continuous as well.

 

Most of the columns in the census dataset are categorical, which means that their values are categories. For example, the education column contains 16 categories, such as High School Grad, Bachelors, Masters, etc.

 

Let’s see what they did with the features. Google’s tutorial code is in the census directory. I’ve made a few changes to it so I can show you some things later that aren’t in Google’s tutorial. The code is broken into two scripts: task and model. The model script builds the model and the task script runs it.

 

Before we get started, I should point out that these scripts are used for both this lesson and the “Wide & Deep Learning” one, so quite a bit of the code isn’t needed for this lesson. In this video, we’ll only talk about the wide model. I’ll explain the deep model parts of the code in the next lesson.

 

Continuous and categorical features are usually treated differently. Let’s start with the continuous columns. They’re defined here. The definitions are straightforward. They’re defined as numeric_columns, just like the features in the iris dataset. OK, that takes care of the continuous columns. Now, what do we do with the categorical columns? Well, those are not so straightforward.

 

Take, education, for example. How would you convert “Bachelors” into something a neural network could use? One way would be to just assign a number to each category. In fact, that has already been done for you in the census dataset. There’s a column called “education_number” with values ranging from 1 to 16 and each one represents an education level. For example, the number 13 represents “Bachelors”.

 

You could have an input node in your network that feeds in that number, and it would even be useful in making predictions, because generally speaking, the higher a person’s education level, the higher their income. However, that wouldn’t work so well with a column like “marital_status”, would it? How would you put these categories in numerical order in such a way that it would help our model learn? There probably is some sort of relationship between marital status and income, but it’s probably not straightforward, and that is exactly the sort of relationship that we want our neural network to figure out, not one that we would tell it in advance.

 

A better way to handle categories is to make each one a separate feature. Rather than having to define each one of them separately, though, which would take a lot of tedious coding, you just specify which columns are categorical, and TensorFlow will turn them into multiple features.

 

There are two ways to do this. If there are a small number of categories in a column, and you already know what they are, then you can specify them using the “categorical_column_with_vocabulary_list” method. This is what this script does with most of the categorical columns. For example, the gender column only has two categories, or keys, so it lists them here. This converts the gender feature into two features, called “female” and “male”.

 

For the occupation and native_country columns, this script uses a shortcut, the “categorical_column_with_hash_bucket” function. This is handy when you don’t know what all of the categories are ahead of time or there are too many categories to list easily.

 

As the neural network reads in data during the training process, whenever it encounters a new category in a categorical column, it creates a new feature for that category. It assigns an ID to the category by using a hash function. That’s why you need to specify the size of the hash table with the hash_bucket_size parameter. These ones have been set to 100.

 

It doesn’t matter too much what size you make the hash table, but it’s best to make it bigger than the number of categories, which is why it’s set to 100 here, just to make sure it’s big enough. If you don’t make it big enough, then you could have a lot of collisions, that is, a lot of categories ending up with the same feature ID, so the model would treat them as being the same category, even though they’re not.

 

Now that you know how categorization works, you might want to consider creating new categorical features based on existing features. This is often a good way to make better use of a continuous column.

 

Sometimes the relationship between a continuous column and the target feature you’re trying to predict is linear, such as between square footage and home value. Generally speaking, the higher the square footage, the higher the home value. In other cases, though, the relationship is not so simple. For example, at first you might think that a person’s income would generally rise as they get older, but that’s not always the case, because most people have a lower income after they retire. There’s no way to model a rising and then falling income with a linear relationship, so we have to change the age feature somehow to allow the network to model it properly.

 

The way to do that is to convert age from a continuous feature to a categorical feature. One approach would be to divide people into an under 65 category and a 65 and over category. Then the neural network could treat these as separate features and discover different relationships between them and their income levels.

 

That would definitely help, but we could categorize the age column further by creating a number of different age ranges. This would help with modeling other age/income relationships, such as how income growth often slows down in the later years of a person’s career.

 

In this script, there are 11 age ranges, based on these boundaries. The first range is everything before the first boundary, so it would be ages 17 and under. The next range is 18 to 24, and so on.

 

The function that does this conversion is called “bucketized_column” because TensorFlow refers to these ranges as buckets.

 

Another way to create new features is to combine existing categorical features. For example, I mentioned earlier that a higher education level generally correlates with a higher income, but the size of the increase often depends on other factors, especially the type of occupation the person has. For example, having a Master's degree may have more of an impact on income if the person is a manager than if they’re a cleaner. If you don’t do any feature engineering, then the model will only learn one weight for having a Master’s degree, regardless of the person’s occupation.

 

To enable the model to learn different weights for different combinations of education and occupation, you can create what’s called a crossed_column. This combines two columns to create a new column with all of the possible combinations. That’s what the script does here with education and occupation. Because it’s creating a new categorical column, you need to create another hash table, but this time, the number of categories could be huge, so you have to create a much larger hash table. Here, it’s set to a size of 10,000, in scientific notation.

 

You can even do this with more than two columns, which is what it does here with age_buckets, education, and occupation. With three columns, the number of combinations is even bigger, so the hash bucket size is set to 1,000,000.

 

Alright, let’s run this script to see how well it does. Go into the census/estimator directory. Then copy this command from the readme file. Note the “--model_type=wide” argument. If you don’t set model_type to wide, then it will run the wide and deep model, which we’ll be going over in the next lesson.

 

It will take about 30 seconds, so I’ll fast forward to when it’s done. You can see that it prints out a lot more than the iris script did. You need to scroll up to see the results. This one lists a bunch of different measures, but the one we care about the most is the very first one, which is accuracy. It came to 82.8%, which isn’t as good as the iris accuracy, but this is a much more complex problem to model.

 

The next metric is kind of interesting, too. The accuracy_baseline is the accuracy rate you want to beat. In this case, the baseline is what the accuracy would be if you always guessed that a person made less than $50,000, which is true for 76% of the people in the dataset. That puts our accuracy in perspective, doesn’t it? 82.8% is good, but considering that we could get 76% with a simple guess, it’s not quite as good as it seemed at first.

 

Are you curious about how much the engineered features improved the accuracy? You can find out by deleting them from the wide_columns array, which tells the model which columns to include. You can just remove the crossed columns and replace the bucketized age with the regular age column.

 

Now save the file and run it again. It comes to about 82.6%. So, the engineered features improved the accuracy by .2 percent, which is helpful, but not fantastic. Quite often, it takes a lot of experimentation with different features to get significant improvements, but for some problems, even a small improvement is well worth the time it takes. For others, it’s not. It all depends on what you’re trying to achieve, which is why you should decide before you get started what is an acceptable accuracy rate, so you don’t spend a huge amount of time trying to squeeze a bit more accuracy out of your model.

 

Before we move on, you should undo the change you made to the script and save it again. I’m assuming your editor has an undo feature.

 

And that’s it for this lesson. In the next lesson, we’ll try to improve the accuracy in a different way.

About the Author

Students16235
Courses41
Learning paths21

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).