Machine learning is a hot topic these days and Google has been one of the biggest newsmakers. Google’s machine learning is being used behind the scenes every day by millions of people. When you search for an image on the web or use Google Translate on foreign language text or use voice dictation on your Android phone, you’re using machine learning. Now Google has launched AI Platform to give its customers the power to train their own neural networks.
This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.
Learning Objectives
- Describe how an artificial neural network functions
- Run a simple TensorFlow program
- Train a model using a distributed cluster on AI Platform
- Increase prediction accuracy using feature engineering and hyperparameter tuning
- Deploy a trained model on AI Platform to make predictions with new data
Resources
- The GitHub repository for this course is at https://github.com/cloudacademy/aiplatform-intro.
Updates
- December 20, 2020: Completely revamped the course due to Google AI Platform replacing Cloud ML Engine and the release of TensorFlow 2.
- Nov. 16, 2018: Updated 90% of the lessons due to major changes in TensorFlow and Google Cloud ML Engine. All of the demos and code walkthroughs were completely redone.
I hope you enjoyed learning how to use AI Platform. Let’s do a quick review of what you learned.
With machine learning, you feed lots of real-world data into a program, and the program tries to make generalizations about the data. It then uses these generalizations to make predictions when it’s given new data. A regression model predicts a number. A classification model predicts a category.
A neural network takes in a variety of features, sets initial weights to the features, and runs a batch of items through them. Then it adjusts the weights to try to minimize the error, which is called the loss. Then it runs another batch of items through the new weights to see what happens to the loss. It keeps going through this loop as many times as you tell it to, and hopefully, it will come up with a combination of weights that minimizes the loss and maximizes the accuracy of its predictions. Then it checks the final weights against a new batch of items.
An optimizer tells the model how to adjust the weights after every training pass. Cross-entropy is a common loss function for classification models. Categorical cross-entropy is what you need to use when you’re classifying into more than two categories. Sparse categorical cross-entropy means that the predicted category is represented by a single integer.
A dataset is typically divided into a training dataset that contains 70-80% of the data and a test dataset that contains the other 20-30%. After you’ve trained your model with the training data, you evaluate its accuracy based on the test data, which it hasn’t seen before.
TensorFlow is an open-source set of Python libraries that makes it easier to create neural networks. It comes with several high-level APIs, including Keras and tf.estimator, that greatly simplify many tasks, require much less code, and are easier to understand than using TensorFlow’s low-level API.
In the Keras API, a Sequential model contains a linear sequence of layers. A typical Sequential model includes an Input layer, some hidden Dense layers, and an Output layer. With a Dense layer, every node in the previous layer is connected to every node in the Dense layer.
This example is considered to be a deep neural network because it has more than three layers. The great thing about deep networks is that they can often discover relationships between the features without any human intervention.
Your training data needs to include a correct answer for each data record. This is the prediction that you want your model to make, and it’s called the label.
Overfitting means that the model essentially memorized the training data and isn’t generalizable for new data.
Google AI Platform is a collection of services that you can use to develop, train, and deploy your machine learning models in the cloud. AI Platform Notebooks lets you run Jupyter notebooks on a virtual machine in GCP. AI Platform Training gives you access to powerful compute resources for training models. AI Platform Prediction gives you an easy way to deploy a trained model as a service.
AI Platform Training supports TensorFlow, scikit-learn, and XGBoost. If you want to train a model that uses a different machine learning framework, such as PyTorch, then you can use a custom container that has the desired machine learning framework and its dependencies installed in it.
Deciding which features to include in your model is known as feature selection. Columns in a dataset are generally either numerical or categorical.
An indicator column creates a separate feature for each category using a one-dimensional array, or vector, to represent the categories. Each number in the vector stands for a category, and it can be either a 0 or a 1. This is known as one-hot encoding because only one of the values is a 1 and the rest are zeroes.
A one-hot vector has high dimensionality, meaning that instead of having just one value, like a numeric feature, it has many values, or dimensions. This can make calculations take too long.
With an embedding column, each category has a smaller vector with values that are not usually 0. The values are weights, and each category has a set of weights. If two categories are very similar to each other, then their embedding vectors should be very similar too.
Creating new features based on existing ones is one of the core activities of feature engineering. You can convert a numeric feature to a categorical feature by dividing the values into ranges, or buckets. In TensorFlow, this is called a bucketized_column.
You can also create new features by combining existing categorical features. In TensorFlow, this is called a crossed_column. If the columns have a lot of categories, then the number of combinations would be very large. To reduce the dimensionality of the new feature, you encode the categories using a hash table.
Another way to improve a model is by experimenting with the hyperparameters. These are the settings for the training run, such as the number of hidden layers and the batch size.
AI Platform provides a way to tune hyperparameters automatically. First, you tell it what you’re trying to optimize, which is called the hyperparameter metric. This is typically set to “accuracy”. Then you tell it which hyperparameters you want to tune. Finally, you need to tell it which search algorithm to use. Random chooses hyperparameter values at random. A grid search tries all values within a grid that you define. A Bayesian search makes intelligent guesses as to which are the best values to try.
When you run a distributed job, AI Platform spins up a training cluster, which is a group of virtual machines. Each of these is called a training instance or a node. When a trainer runs on an instance, it’s called a replica. One of these replicas is designated as the master. It manages the other replicas and it reports the status of the entire job. One or more of the other replicas are designated as workers. They each run a portion of the job. And finally, one or more of the replicas are designated as parameter servers. They keep track of the weights, or parameters, for the whole model. They’re only used if you choose an asynchronous strategy.
TensorFlow supports two types of distributed training: synchronous and asynchronous. With synchronous training, all of the workers keep a copy of the parameters, and the parameters are updated on all workers at the end of every training step. With asynchronous training, the workers run independently and send their parameter updates to the parameter servers.
The tf.distribute.Strategy API lets you specify which of these approaches you want to use.
ParameterServerStrategy is an asynchronous strategy.
The synchronous strategies differ mostly on the type of hardware they use. The main options are GPUs and TPUs. A TPU is a chip that was specially designed by Google to run machine learning jobs, and it’s much faster than a GPU.
Here are the synchronous strategies. MirroredStrategy distributes a job across multiple GPUs on a single machine. TPUStrategy distributes a job across multiple TPU cores. MultiWorkerMirroredStrategy distributes across multiple GPUs on multiple workers. CentralStorageStrategy stores the parameters with the CPU.
If you don’t specify a strategy, then it uses the Default strategy, which runs the job on one device. Similarly, the OneDeviceStrategy runs a job on one device, but it explicitly puts the parameters on the device.
In AI Platform, a model is a resource where you put different versions of a trained model. AI Platform supports two types of prediction services. Online prediction returns its predictions very quickly, so it’s typically used by applications that need a real-time response. Batch prediction is optimized for big jobs, and it takes longer to start up, but it tends to be cheaper overall.
To learn more about Cloud AI Platform, you can read Google’s online documentation. Also, watch for new machine learning courses on Cloud Academy, because we’re always creating new courses.
Please give this course a rating, and if you have any questions or comments, please let us know. Thanks!
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).