1. Home
  2. Training Library
  3. Google Cloud Platform
  4. Courses
  5. Introduction to Google Cloud Machine Learning Engine

Distributed Training on ML Engine

The course is part of these learning paths

Machine Learning on Google Cloud Platform
course-steps 2 certification 1


Training Your First Neural Network
Improving Accuracy
Start course


Machine learning is a hot topic these days and Google has been one of the biggest newsmakers. Recently, Google’s AlphaGo program beat the world’s No. 1 ranked Go player. That’s impressive, but Google’s machine learning is being used behind the scenes every day by millions of people. When you search for an image on the web or use Google Translate on foreign language text or use voice dictation on your Android phone, you’re using machine learning. Now Google has launched Cloud Machine Learning Engine to give its customers the power to train their own neural networks.

If you look in Google’s documentation for Cloud Machine Learning Engine, you’ll find a Getting Started guide. It gives a walkthrough of the various things you can do with ML Engine, but it says that you should already have experience with machine learning and TensorFlow first. Those are two very advanced subjects, which normally take a long time to learn, but I’m going to give you enough of an overview that you’ll be able to train and deploy machine learning models using ML Engine.

This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.

Learning Objectives

  • Describe how an artificial neural network functions
  • Run a simple TensorFlow program
  • Train a model using a distributed cluster on Cloud ML Engine
  • Increase prediction accuracy using feature engineering and both wide and deep networks
  • Deploy a trained model on Cloud ML Engine to make predictions with new data



  • Nov. 16, 2018: Updated 90% of the lessons due to major changes in TensorFlow and Google Cloud ML Engine. All of the demos and code walkthroughs were completely redone.


Alright, it’s finally time to get back to Cloud ML Engine. The biggest reason for using ML Engine is that you can train a model on a cluster of servers instead of just one. This is known as distributed training. Most real-world datasets for machine learning are much larger than the ones we’ve been using in this course. Trying to train models using these datasets would take far too long on a single machine, so you’ll likely need to do distributed training most of the time.


We’re still going to use a small dataset to learn how to run distributed training, though. Google uses the census example in its Getting Started guide.


One of the great things about the tf.estimator library is that it supports distributed training, so if you use it, you don’t have to worry about how to distribute your code across a compute cluster. Specifically, the tf.estimator.train_and_evaluate function takes care of training a model using a cluster. If you used the low-level TensorFlow API, you’d have to do all of this yourself.


You’ll notice that the majority of the task script just defines the command-line arguments you can pass to it. You’ve seen some of them already, like train-files and model-type, but there are lots of other ones, too. Many of them are what are called hyperparameters. These are essentially the settings for the training run. For example, the number of hidden layers is a hyperparameter. It’s part of the model, but it’s something you set ahead of time, rather than something that the model learns during its training run. Another example is the embedding_size, which is the number of embedding dimensions for categorical columns.


Deciding what settings to use for the various hyperparameters is often a guess. To tune a hyperparameter, you need to do an entire training run and see how it performs, then adjust the hyperparameter and do another training run, and so on. This can be time-consuming and tedious, so ML Engine provides a way to tune hyperparameters automatically. It does require code changes, though. It can also be costly because ML Engine has to run many trials with different values for the hyperparameter in question to see which value gives the best results. For this reason, you can set the maximum number of trials to run. If you’re interested in tuning hyperparameters, check out Google’s Overview of Hyperparameter Tuning.


Alright, now we’re going to run it, but first you need to copy the data files to the Google Cloud bucket you created earlier.


If the BUCKET environment variable that you set earlier still exists, then you can type “gsutil cp -r gs://cloudml-public/census/data $BUCKET”. If it’s not set anymore, then please set it again to the URL of your Cloud Storage bucket before you run this command. You’ll also need to set the REGION environment variable again if it’s not still set.


Now create two more environment variables pointing to the two data files. One more environment variable that will be helpful is JOB. Set it to what you want to call this job. Remember that each ml-engine job has to have a unique name. I’ll call mine “census1”.


OK, we’re finally ready to type the command to run the training job. Make sure you’re in the census/estimator directory first. You’ll definitely want to copy this command from my GitHub page.


While we’re waiting for the job to run, I’ll explain the big command we entered. Most of the arguments are pretty self-explanatory, but there are a couple of things to note. First, the arguments at the top are for ml-engine and the ones at the bottom are for the TensorFlow script. You have to separate them with two dashes, like this.


Second, the scale-tier flag is what you use to tell ml-engine what the distributed environment should look like. If you don’t set this flag, then it defaults to the BASIC tier, which runs on only one VM instance. The STANDARD_1 tier, which we specified here, runs on many workers and a few parameter servers.


So what are workers and parameter servers? Let’s take a step back and go over what a distributed environment looks like. When you run a distributed job, ML Engine spins up a training cluster, which is a group of virtual machines. Each of these is called a training instance or a node. Since we’ve already used the term node in neural networks, I’ll stick with the term training instance.


ML Engine installs your Python package and its dependencies on each instance. When this trainer runs, it’s called a replica. One of these replicas is designated as the master. It manages the other replicas and it reports the status of the entire job.


One or more of the other replicas are designated as workers. They each run a portion of the job.


And finally, one or more of the replicas are designated as parameter servers. They coordinate the shared model state between the workers.


If we weren’t using the high-level API in the script, then we’d have to write code to create the cluster and divide up the work between the instances. With the tf.estimator library and ML Engine, all we have to do is specify a scale-tier and it takes care of all of that for us.


If you want anything other than a BASIC or STANDARD_1 tier, then you’ll probably need to create a CUSTOM tier. This allows you to specify the number of workers and parameter servers, as well as what types of machines to use.


One interesting option is to use GPU-enabled machines. Due to the massively parallel architecture of graphics processing units, they can dramatically speed up the training time for image classification, video analysis, and other highly parallel workloads. You do have to make some code changes to make it work, though.


OK, I’ve fast forwarded to when this job completed, so we can check the results. Let’s look at the logs in the Cloud Console. First, run this command. Then go to this URL in your browser.


Here, you can see the status of your job. If you click on this link, you can see the logs in Stackdriver, which is a lot nicer than looking through them in your terminal window.


The last log entries don’t show the accuracy, so you have to scroll up a bit to see it. It’s easy to miss, but it’s right after the evaluation phase finished. The accuracy was .837, which is pretty close to when we ran it locally.


You would normally run distributed training jobs on ML Engine only when your local machine would take too long on its own, so this isn’t a very realistic example, but it does show you how to run a distributed job, and it shouldn’t cost much.


Speaking of which, how can you tell how much a job will cost? On Google’s pricing page, it says that training in the US costs 49 cents per hour per training unit. So what’s a training unit? It’s essentially a measure of the size of the scale tier that you use to run a job. For this job, we used the STANDARD_1 tier, which is 5.9234 training units.


To calculate the price, we first need to multiply the number of training units by the number of hours to get the number of training unit-hours. However, there’s a 10-minute minimum charge, so since this job only took 5 minutes to run, it will be a 10-minute charge, which is 10 over 60 to convert it to hours. This comes to about 0.99 training unit-hours.


On the job details page, it has a field called “Consumed ML units”, which says 0.99. It should really say “Consumed training unit-hours” because that’s what it actually is. Anyway, to get the cost of the job, we multiply the 0.99 training unit-hours by 49 cents, so we get about 49 cents. That’s pretty cheap considering the size of the cluster that it spun up to run this job.


And that’s it for this lesson.

About the Author

Learning paths22

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).