Using Azure ML Studio
The course is part of this learning path
Machine learning is a notoriously complex subject, which usually requires a great deal of advanced math and software development skills. That’s why it’s so amazing that Azure Machine Learning Studio lets you train and deploy machine learning models without any coding, using a drag-and-drop interface. With this web-based software, you can create applications for predicting everything from customer churn rates, to image classifications, to compelling product recommendations.
In this course, you will learn the basic concepts of machine learning, and then follow hands-on examples of choosing an algorithm, running data through a model, and deploying a trained model as a predictive web service.
- Prepare data for use by an Azure Machine Learning Studio experiment
- Train a machine learning model in Azure Machine Learning Studio
- Deploy a trained model to make predictions
- Anyone who is interested in machine learning
- No mandatory prerequisites
- Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
This Course Includes
- 54 minutes of high-definition video
- Many hands-on demos
When we built an iris classification model, we didn’t need to do anything with the source dataset. It was in the right format and contained everything we needed. This is rarely the case in the real world. In fact, data scientists typically spend a large percentage of their time preparing data for their machine learning experiments.
In this lesson, we’re going to use a dataset that will require a little bit of massaging. It contains information about different automobiles. Our goal is to create a model that will predict the prices of specific automobiles.
Let’s start with a blank experiment again. First, give it a name, such as “Auto Price Experiment”.
In the real world, you’ll usually need to import your data into ML Studio. If you open the “Data Input and Output” menu, you’ll see an “Import Data” module.
Here are the different types of sources you can import data from. Half of them are Azure services, such as DocumentDB, and then there are more generic ones like “Web URL via HTTP”. This will let you import any dataset that’s accessible over the web. When you use the Azure Blob Storage option, here are the data formats that are supported. The most commonly used one is CSV.
For this experiment, we’re going to use a sample dataset, though, so I’ll get rid of this module. If you’re following along, then open the list of sample datasets. The one we want is near the top. It’s called “Automobile price data”. Drag it onto the canvas. Now right-click to have a look at the data.
It has 205 rows, which isn’t much bigger than the iris dataset, but it has 26 columns, which is way more than the 5 columns in the iris dataset. It has automobile features you’d expect, like make and number of doors, but it also has some obscure ones, like aspiration. The last column is “price”, which is the label or target column. It’s what the machine learning model should try to predict based on the other features.
Before we do any pre-processing on this dataset, let’s run it through a model as-is and see what kind of accuracy we get.
First we’ll split it into training and test datasets. Set this to .8 again. Now connect the modules. Then add “Train Model”. And connect it to the right-hand input. Now launch the column selector and type “price” because that’s the label column that the model should try to predict.
Now we need to choose an algorithm. We’re trying to predict price, which is a number, so we should choose a regression algorithm. Let’s go with the regression version of Decision Forest. And hook it up to the left-hand input.
Now add Score Model. Hook it up to Train Model and to the right-hand output from the Split Data module, which is the test dataset.
And finally, add “Evaluate Model” and connect it up. OK, let’s click run and see what we get.
Alright, when it’s finished, right-click on “Evaluate Model”. This looks different from what we saw with the iris model because we’re doing regression this time, instead of classification. The most important number to look at is the “Coefficient of Determination”. It’s also known as R squared. This is always a number between 0 and 1. A 1 means that the model perfectly fits the data, so the .88 that we got is quite good.
Let’s see if we can do better, but first, do you want to see what a more complex decision forest looks like? Visualize the trained model. That’s definitely a more interesting decision tree than before, isn’t it? This isn’t even the whole tree. It’s only showing the first eight layers or so.
Alright, now let’s see if we can improve the accuracy. Have a look at the source dataset again. I didn’t mention this the last time, but there’s a little bar graph at the top of each column. It shows the distribution of values in that column. To get a close-up view, click on the column. Here it says there are 22 unique values in this column. The graph below shows the 10 most common values. It also says that there are zero missing values. This is important because ideally we should have no missing values anywhere in the dataset.
As you can see, though, the normalized-losses column has lots of missing values. How many? Wow, it has 41 missing values. We should probably do something about that, but let’s see how the rest of the columns look first. The other columns seem to be in better shape. But there’s a missing value in price. That’s not good because the data for that row is useless if it doesn’t have a price. Let’s see how many missing values it has. Four. Well, that’s not too bad, but we should still remove those rows.
Instead of looking through the menus for a module that will do that, just type “missing” in the search box and see what comes up. “Clean Missing Data” looks like what we want, so drag it over. I’ll just make some room for it and delete the link. Now connect it to the dataset. If you don’t do that first, then you won’t be able to configure the module properly, because it won’t know what data it’s supposed to clean.
Now click on the module and launch the column selector. We need to tell it which columns to clean. Under “Begin With”, click “No Columns”. This tells it that we’re going to choose which columns we want to clean rather than starting with all of the columns and telling it which ones we don’t want to clean. Select “column names” if it isn’t selected already. Now start typing “price” and it will come up in the list. Select it. And click the checkmark.
So we’ve told it that we want to clean the price column. Now we need to tell it what to do when it finds a missing value in that column. Change the cleaning mode to “Remove entire row”.
OK. This module has two outputs, so which one do we need to use? The one on the left is the cleaned dataset, which is what we want. The one on the right is the cleaning transformation. You’d use that if you wanted to apply the same cleaning rules to another dataset. We don’t need to do that, so just connect the left-hand output to the Split Data module.
OK, let’s see how much of a difference it makes to remove those four rows that don’t have a price. Before we do that, though, it’s a good idea to make a quick note in the Summary to say what we changed for this run of the experiment. As you’ll see later, this will be very helpful when we need to look through various runs. Click on the background to see the Summary field. Type something like, “Removed rows with missing price”. Alright, now run it.
When it’s done, right-click on Evaluate. Wow, the R squared value jumped to .96. That’s a huge improvement from the .88 we got on the previous run. It’s obvious that it really messes up the model when you have missing values in the label column.
Can we improve the accuracy even more? Well, one of the most important ways to improve accuracy is to select the right features to put in the model. Remember the normalized-losses column? It had 41 missing values, so it’s probably not a very good feature to use. And what does normalized losses mean, anyway? Well, it’s a statistic used by insurance companies related to loss payments. Let’s remove it.
Search for “column” to see what module we can use. “Select Columns in Dataset” is what we need. Make room for it again and connect it to the module above. Then click on it and launch the column selector. There are a couple of ways to do this, but let’s go to “With Rules” and then begin with all columns (because we want to include all of the columns except normalized-losses), and then change the dropdown to “Exclude” and click in the blank. Select “normalized-losses”. That’s it, so click the checkmark.
Now connect the module to the one below. Then add what we changed to the summary. I’ll say, “Removed normalized-losses column”. Now run it again.
OK, when it’s done, right-click on Evaluate. Excellent, the R squared value went up to .97.
Is there anything else we could do to improve the accuracy? Maybe we should try to do something with other missing values. Now that the normalized-losses column is gone, the next two columns with the most missing values are bore and stroke. They’re each missing four values. If you scroll down, you can see that they’re both missing in the same rows. We could either try removing those rows or removing these two columns entirely. It’s actually possible to test both at the same time.
We’re going to need a bit more room on the canvas, so let’s shrink the scale a little bit. First, select all of the modules except the top and bottom ones by clicking and dragging. Then right-click and copy. And right-click again and paste. Drag the copies over to the side. I’ll just make this fit on the page better.
Now connect the source dataset to the first module in the copy. And connect the second Score module to the Evaluate module.
Now, in the left path, we’re going to remove the bore and stroke columns, and in the right path, we’re going to remove the rows that have missing values for bore and stroke. Click here and launch the column selector. Then type “bore” and “stroke”.
Now in the right path, click on the “Clean Missing Data” module and launch the column selector. Then type bore and stroke again. Now add these changes to the summary. I’ll type, “Left: removed bore & stroke columns. Right: Removed rows missing bore & stroke.” You can see that there’s a character limit for the summary, so either we could put this in the description or we could just leave out some spaces here, which is what I’m going to do. Now click Run and it’ll evaluate both paths.
OK, let’s see how they did. Well, neither one of those ideas worked. Removing the bore and stroke columns dropped the R squared back to .96 and removing the rows that were missing bore and stroke values dropped it down to .83! That’s a pretty weird result, but when you’re using such a small amount of data like this, removing a few rows can sometimes make a big difference in unexpected ways. The bottom line is that our previous run was the best, so let’s go back to it.
Click “Run History”. The first one is the editable experiment we were just in. That is, it’s not a run. It’s what we’d need to click on to go back to editing our experiment. All of the other ones are runs and they say “Locked” because you can’t edit them. The first of those is the run we just did. The one after that is the previous run, which is the one we want, so click on it.
You can tell that this was the previous run because there’s only one path. You can also see in the summary what we did in this experiment. You can verify that this was the best run by checking the evaluation. Yes, it’s .97, which was our best result. Save this version by clicking “Save As”. Change the name to just “Auto Price”.
And that’s it for this lesson.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).