image
Cleaning Data
Start course
Difficulty
Beginner
Duration
50m
Students
5407
Ratings
4.7/5
Description

Machine learning is a notoriously complex subject that usually requires a great deal of advanced math and software development skills. That’s why it’s so amazing that Azure Machine Learning lets you train and deploy machine learning models without any coding, using a drag-and-drop interface. With this web-based software, you can create applications for predicting everything from customer churn rates to image classifications to compelling product recommendations.

In this course, you will learn the basic concepts of machine learning and then follow hands-on examples of choosing an algorithm, running data through a model, and deploying a trained model as a predictive web service.

Learning Objectives

  • Create an Azure Machine Learning workspace
  • Train a machine learning model using the drag-and-drop interface
  • Deploy a trained model to make predictions based on new data

Intended Audience

  • Anyone who is interested in machine learning

Prerequisites

  • General technical knowledge
  • A Microsoft Azure account is recommended (sign up for a free trial at https://azure.microsoft.com/free if you don’t have an account)

Resources

The GitHub repository for this course is at https://github.com/cloudacademy/azureml-intro.



Transcript

We were able to get a very high accuracy with the model we created, but can we improve it without overfitting? Let’s take another look at the dataset. Click on the Automobile price data module. In the description, it says, “Clean missing data module required.” Whoever created this module is saying that there’s missing data in it, and we need to clean it up. Despite what it says, cleaning the missing data is not actually required, but it might help our model.

Let’s see if we can find this missing data. Let’s have a look at the make column again. Under Statistics, it tells us how many missing values there are in this column. It says ‘0’, so there’s no problem here.

How about normalized-losses? Wow, it has 41 missing values. What does normalized losses mean, anyway? Well, it’s a statistic used by insurance companies related to loss payments. If we scroll down, we should be able to see some empty values, but they’re actually all filled in. The missing values are the ones that say, “NaN”, which stands for “Not a Number”. Since this column has so many missing values, let’s remove the entire column.

The module we need is in the Data Transformation section. It’s called Select Columns in Dataset. Let’s make some room for it. And let’s delete the arrow. 

Connect the dataset module to its input. Now click the module to open up the properties blade again. This is where we specify which columns need to be removed. Click Edit column. Before we tell it which columns to exclude, we need to tell it which ones to include. From the dropdown, select All columns. Then click the Plus button.

Now we can tell it to exclude the normalized-losses column. From the dropdown, select Exclude. Then in this dropdown, select Column names. Now click in this field and select normalized-losses. Then click Save. Now let’s add a comment saying what this module does.

Now we just need to connect this module to the Split Data module and click Submit to run the pipeline again. Change the run description to “Exclude normalized-losses”.

All right, it’s done. Let’s see what happened with the accuracy. It went down to 92.5%. We got 93.5% before, so this actually made it worse. It seems that despite the fact that there are so many values missing from the normalized-losses column, the values that it does have must be useful for making predictions. This is why these are called experiments. You have to try lots of different things and see what happens.

Let’s delete that module and try something else. Click on the dataset again and then visualize it. This time, let’s look at the price column. It’s missing 4 values. Considering that there are 205 rows in this dataset, that’s not very many missing values, but the fact that this is the label column might make a difference. The label column contains the answers that the model is trying to predict, so if it has missing values, that might mess up the model.

We definitely can’t remove the entire column like we did with the normalized-losses column, but we can remove the rows that don’t have a price.

The module we need to do that is called Clean Missing Data. First, let’s delete this one. Close the blade, and connect the dataset module to this one. Then click on the module. Click Edit column. Leave it set to Column names. Then scroll down and select price. And save it.

If we wanted to specify that the column would only be cleaned if it were missing a certain percentage of its values, then we could specify that here, but we know we want to clean this column, so let’s just leave that.

The cleaning mode is where we specify what we want to do with missing values. For example, if we select Replace with mean, it’ll calculate the average of the other values in the column and use that number for all of the missing values. The option we want is Remove entire row. Let’s add a comment saying, Remove rows with missing price.

You probably noticed that there are two output circles at the bottom of the Clean Missing Data module. So which one do we need to use? The one on the left is the cleaned dataset, which is what we want. The one on the right is the cleaning transformation. You’d use that if you wanted to apply the same cleaning rules to another dataset. We don’t need to do that, so just connect the left-hand output to the Split Data module. Then click Submit, and change the run description to “remove rows with missing price”. And click Submit.

Okay, it’s done. Let’s see how it did. It went up to 95.1%! That’s a pretty big jump from 93.5%. Let’s make sure this pipeline is saved. Click the save icon.

And that’s it for cleaning data.

About the Author
Students
216268
Courses
98
Learning Paths
164

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).