Training a Model with AI Platform

Contents

Introduction
1
Introduction
PREVIEW1m 44s
Training Your First Neural Network
3
TensorFlow
12m 40s
Improving Accuracy
Summary
10
Summary
8m 4s
Start course
Difficulty
Intermediate
Duration
1h 3m
Students
3405
Ratings
4.7/5
starstarstarstarstar-half
Description

Machine learning is a hot topic these days and Google has been one of the biggest newsmakers. Google’s machine learning is being used behind the scenes every day by millions of people. When you search for an image on the web or use Google Translate on foreign language text or use voice dictation on your Android phone, you’re using machine learning. Now Google has launched AI Platform to give its customers the power to train their own neural networks.

This is a hands-on course where you can follow along with the demos using your own Google Cloud account or a trial account.

Learning Objectives

  • Describe how an artificial neural network functions
  • Run a simple TensorFlow program
  • Train a model using a distributed cluster on AI Platform
  • Increase prediction accuracy using feature engineering and hyperparameter tuning
  • Deploy a trained model on AI Platform to make predictions with new data

Resources

Updates

  • December 20, 2020: Completely revamped the course due to Google AI Platform replacing Cloud ML Engine and the release of TensorFlow 2.
  • Nov. 16, 2018: Updated 90% of the lessons due to major changes in TensorFlow and Google Cloud ML Engine. All of the demos and code walkthroughs were completely redone.
Transcript

Now that you have some experience with TensorFlow scripts, it’s time to see how to run one on AI Platform.

So what is Google AI Platform? It’s a collection of services that you can use to develop, train, and deploy your machine learning models in the cloud. This suite of services is evolving, so I’m not going to talk about all of them, but here are three core services in the suite.

First, when you’re developing a machine learning model, it can be very helpful to experiment with the code interactively and keep a record of what you’re doing. The easiest way to do that is to use a Jupyter notebook, which is an open-source, web-based application for running and sharing code. Google provides a service called AI Platform Notebooks that lets you run Jupyter notebooks on a virtual machine in GCP.

When you’ve developed a model that’s too big to train on a single virtual machine, you can use AI Platform Training, which gives you access to powerful compute resources.

Once you’ve trained a model successfully, you can deploy it as a service so that your applications can send it new instances of data and receive predictions about this new data. The AI Platform Prediction service gives you an easy way to set this up.

In this course, we’re going to focus on AI Platform’s training and prediction services. Okay, now I’ll show you how to run the iris script on AI Platform Training.

If you haven’t already installed the Google Cloud SDK on your computer, then do that first. The installation instructions are at https://cloud.google.com/sdk. You’ll probably need to do that outside of the virtual Python environment, though, so it would be best to do it in another terminal.

To run a TensorFlow program on AI Platform, it has to be in a Python package rather than just an individual script file. Fortunately, it’s very easy to turn it into a package. All you have to do is create a file called “__init__.py” in the directory where your script resides. You don’t need to put anything in the file, but it needs to be there. I’ve included that file in this directory, so you don’t need to create it yourself. OK, now you have to be in the parent directory to run it, so go to the iris directory if you’re not there already.

The command to run it on AI Platform is “gcloud ai-platform”. Before running the TensorFlow program in the cloud, we should test whether or not it will work with AI Platform first. The way to do that is to put “local” after “gcloud ai-platform”. This runs your Python module locally on your own computer, but in an environment similar to the one it would run in if you were to run it in the Google Cloud. You won’t be charged for anything you run locally, so it’s a good way to test your package before submitting a training job to the cloud.

After “local”, type “train” because you’re training a model. Next type “--module-name” and the name of your module, which is the directory name, “trainer”, dot, then the name of your script, but without the “.py” extension at the end, so just “iris”. Then “--package-path” and the path of the directory. Since the “trainer” directory is in the current directory, you can just say “trainer”, but if you were somewhere else, then you’d need to put in the full pathname. Then add the job-dir argument. 

It should take about 10 seconds before it starts to run. Okay, we got approximately the same result as before. Now, to run it in the cloud, first you need to have a Cloud Storage bucket so it has a place to upload your package. If you don’t already have one that you can use, then you should create one that starts with your project ID. Since Cloud Storage bucket names have to be globally unique across all Google Cloud customers, starting the bucket name with your project ID is a good way to make sure it’s a unique name. This command will get your project ID and put it in an environment variable.

Then you can create the bucket name with this command.

You need to create the bucket in the same region as where you’re going to run your AI Platform jobs. To see which regions support AI Platform, go to this link.

Find the region in the list that’s closest to you, and then set the REGION variable to it. Now create your bucket in that region using this command.

Now that you have a bucket, you can submit your job. First, decide on a name for the job. You can call it whatever you want, but you won’t be able to use the same job name again in the future. One way to ensure it’s always a unique name is to include a timestamp, but let’s just use a simple name for now, like “iris1”. Let’s put the job name in an environment variable as well.

The command to submit your job is “gcloud ai-platform jobs submit training”, then the job name.

Then add the same module-name and package-path arguments as before, then “--staging-bucket $BUCKET”, then “--region” and the name of the region where you created the bucket. Then you have to tell it which version of Python to use and which version of TensorFlow to use. 

Technically, this is the AI Platform Training runtime version rather than the TensorFlow version, but those two numbers are always the same. I’m mentioning this because it actually supports a couple of non-TensorFlow options as well: scikit-learn and XGBoost. The version numbers of those frameworks don’t match the runtime version. You can see all of the version information at this page.

I should also mention that if you want to train a model that uses a different machine learning framework than the ones supported here, such as PyTorch, then you can still use AI Platform Training, but you have to do it differently. Instead of using the --runtime-version argument, you need to use the --master-image-uri argument and point to a custom Docker container that has the desired machine learning framework and its dependencies installed in it.

Okay, let’s get back to the command. The final argument is the job directory. We’re going to save the trained model to your Cloud Storage bucket.

If you haven’t enabled the machine learning API on this project, then it’ll ask you to do it now. Once it’s enabled, it automatically starts your job. This time it will take a lot longer than 10 seconds because it needs to spin up an environment to run your job. This command tells you what state the job is in, among other things. It says it’s preparing. It also gives you a link to view the logs in the console.

It’s going to take a while for the job to be provisioned, so I’m going to fast forward to when it’s done.

Okay, it’s done. If you look at the timestamps of the log entries, you’ll see that it spent the vast majority of the time getting the environment and the job set up, and then it took about a minute to actually run the TensorFlow script. So there’s a lot of overhead when you run an AI Platform job and it can take way longer than running it on your local machine. Not only that, but you have to pay for it too. So why would you run your training jobs in the cloud instead of on your own machine? Well, because most machine learning jobs take far longer to run than this one, and if you tried to run them on your own computer, it could take days.

In the next lesson, we’ll look at a more complex model.

About the Author
Students
192820
Courses
99
Learning Paths
169

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).