1. Home
  2. Training Library
  3. Microsoft Azure
  4. Courses
  5. Introduction to Azure Machine Learning

Overfitting

Contents

keyboard_tab
Introduction
1
Course Introduction
PREVIEW1m 11s
Using the Designer
2
Training a Model
PREVIEW14m 26s
Summary

The course is part of these learning paths

play-arrow
Start course
Overview
DifficultyBeginner
Duration50m
Students159
Ratings
4.6/5
starstarstarstarstar-half

Description

Machine learning is a notoriously complex subject that usually requires a great deal of advanced math and software development skills. That’s why it’s so amazing that Azure Machine Learning lets you train and deploy machine learning models without any coding, using a drag-and-drop interface. With this web-based software, you can create applications for predicting everything from customer churn rates to image classifications to compelling product recommendations.

In this course, you will learn the basic concepts of machine learning and then follow hands-on examples of choosing an algorithm, running data through a model, and deploying a trained model as a predictive web service.

Learning Objectives

  • Create an Azure Machine Learning workspace
  • Train a machine learning model using the drag-and-drop interface
  • Deploy a trained model to make predictions based on new data

Intended Audience

  • Anyone who is interested in machine learning

Prerequisites

  • General technical knowledge
  • A Microsoft Azure account is recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)

Resources

The GitHub repository for this course is at https://github.com/cloudacademy/azureml-intro.



Transcript

In the last lesson, we built a model that had 98% accuracy, which seems great, but probably isn’t. The problem is that we used the same data to evaluate the model that we used to train it. Why is this a problem? Suppose you have these data points and you want to do a regression on them. If you were to do a linear regression, then it would look something like this. But suppose you tried to get a higher accuracy by doing a nonlinear regression and you ended up with this. This model would perfectly fit the training data, but if you were to run some new data points through the model, it would likely have lower accuracy than the linear model.

This is called overfitting, and it’s something you always have to watch out for in machine learning. One critical way to reduce the risk of overfitting is to evaluate the model using a separate test dataset. That way, the accuracy score will be much lower if there’s an overfitting problem. The easiest way to come up with a test dataset is simply to split the original dataset into two pieces. You’d typically put 70-80% of the data in the training dataset and the other 20-30% in the test dataset.

Let’s do that with the automobile data. In the “Data Transformation” section, find the “Split Data” module. There it is. I’ll make some room for it first.

This is where you tell it how much of the data to put in the first dataset. It’s currently set to 0.5, which means that half of the data will go into the first dataset and half will go in the second one. We want 80% to go in the training dataset, so change this to 0.8. Also make sure that “Randomized split” is checked. Otherwise, it will simply put the first 80% of the rows in the training dataset, which would be a problem if the dataset is sorted in some way. In the comment field, type 80 / 20 split. See how the comment shows up in the module? Comments are a handy way to see exactly what a module does without having to look at its properties.

Now delete the existing arrows under the dataset module by right-clicking them and selecting Delete. All right, now connect the dataset to the Split Data module. Then connect the left output, which is 80% of the data, to the Train module, and connect the right output to the Score module. 

Now click Submit. Since we made a change to the pipeline since the last time we ran it, let’s change the run description so we’ll know what we changed. Type “80/20 split” for the description. Then click Submit.

All right, it’s done. Let’s have a look at the evaluation. This time the accuracy is about 93.5%. That’s still very good, but it’s much lower than the 98% we got without splitting the dataset. So we likely had an overfitting problem before. That’s not surprising. You should always evaluate a model using data that it didn’t see during the training phase.

By the way, if you want to see where the run description shows up, click on Designer, then click on Pipeline runs. If you want to look at a previous run, then the description here is very helpful for finding the run you’re looking for.

And that’s it for overfitting.

About the Author
Students55097
Courses61
Learning paths63

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).