Distributed Training on AI Platform


Training Your First Neural Network
12m 40s
Improving Accuracy
8m 4s
Start course
1h 3m

Machine learning is a hot topic these days and Google has been one of the biggest newsmakers. Google’s machine learning is being used behind the scenes every day by millions of people. When you search for an image on the web or use Google Translate on foreign language text or use voice dictation on your Android phone, you’re using machine learning. Now Google has launched AI Platform to give its customers the power to train their own neural networks.

This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.

Learning Objectives

  • Describe how an artificial neural network functions
  • Run a simple TensorFlow program
  • Train a model using a distributed cluster on AI Platform
  • Increase prediction accuracy using feature engineering and hyperparameter tuning
  • Deploy a trained model on AI Platform to make predictions with new data



  • December 20, 2020: Completely revamped the course due to Google AI Platform replacing Cloud ML Engine and the release of TensorFlow 2.
  • Nov. 16, 2018: Updated 90% of the lessons due to major changes in TensorFlow and Google Cloud ML Engine. All of the demos and code walkthroughs were completely redone.

The biggest reason for using AI Platform is that you can train a model on a cluster of servers instead of just one. This is known as distributed training. Most real-world datasets for machine learning are much larger than the ones we’ve been using in this course. Trying to train models using these datasets would take far too long on a single machine, so you’ll likely need to do distributed training most of the time. 

To run a distributed training job on AI Platform, you need to add the scale-tier flag. This tells AI Platform what the distributed environment should look like. If you don’t set this flag, then it defaults to the BASIC tier, which runs on only one VM instance. The STANDARD_1 tier runs on many workers and a few parameter servers.

If you want anything other than one of the predefined tiers, then you’ll need to create a CUSTOM tier. This allows you to specify the number of workers and parameter servers, as well as what types of machines to use.

So what are workers and parameter servers? When you run a distributed job, AI Platform spins up a training cluster, which is a group of virtual machines. Each of these is called a training instance or a node. Since we’ve already used the term node in neural networks, I’ll stick with the term training instance.

AI Platform installs your Python package and its dependencies on each instance. When this trainer runs, it’s called a replica. One of these replicas is designated as the master. It manages the other replicas and it reports the status of the entire job.

One or more of the other replicas are designated as workers. They each run a portion of the job.

And finally, one or more of the replicas are designated as parameter servers. I’ll explain what these do in a minute.

TensorFlow supports two types of distributed training: synchronous and asynchronous. With synchronous training, all of the workers keep a copy of the parameters, and the parameters are updated on all workers at the end of every training step. Thus, the workers run their training steps at the same time and stay in sync with each other. The disadvantage of this strategy is that making all the workers stay in sync can slow down the training job.

With asynchronous training, the workers run independently and send their parameter updates to one or more parameter servers. The disadvantage of asynchronous training is that you need more machines, and there is a lot of network traffic between the workers and the parameter servers.

The tf.distribute.Strategy API lets you specify which of these approaches you want to use. 

At the moment, the only asynchronous strategy it supports is ParameterServerStrategy, which works in the way I just described. 

The API offers quite a few synchronous strategies that mostly depend on the type of hardware you want to use.

One option is to use machines with GPUs. Due to the massively parallel architecture of graphics processing units, they can dramatically speed up the training time for image classification, video analysis, and other highly parallel workloads. 

Another option is to use a TPU, which stands for Tensor Processing Unit. This is a chip that was specially designed by Google to run machine learning jobs. According to this benchmark, it’s cheaper and much faster than using 8 GPUs. Whether you use GPUs or TPUs, you do have to make some code changes, though.

The first synchronous strategy is MirroredStrategy, which distributes a job across multiple GPUs on a single machine. The TPUStrategy distributes a job across multiple TPU cores. MultiWorkerMirroredStrategy distributes across multiple GPUs on multiple workers.

The CentralStorageStrategy is synchronous, but it’s a bit different. Instead of mirroring the parameters to all of the GPUs in a machine, it stores them with the CPU.

There are also a couple of strategies that don’t actually distribute a job. If you don’t specify a strategy, then it uses the Default strategy, which runs the job on one device. Similarly, the OneDeviceStrategy runs a job on one device, but it explicitly puts the parameters on the device. This strategy is mostly used for testing before using a “real” distribution strategy.

And that’s it for this lesson.

About the Author
Learning Paths

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).