Using the Designer
The course is part of these learning paths
Machine learning is a notoriously complex subject that usually requires a great deal of advanced math and software development skills. That’s why it’s so amazing that Azure Machine Learning lets you train and deploy machine learning models without any coding, using a drag-and-drop interface. With this web-based software, you can create applications for predicting everything from customer churn rates to image classifications to compelling product recommendations.
In this course, you will learn the basic concepts of machine learning and then follow hands-on examples of choosing an algorithm, running data through a model, and deploying a trained model as a predictive web service.
- Create an Azure Machine Learning workspace
- Train a machine learning model using the drag-and-drop interface
- Deploy a trained model to make predictions based on new data
- Anyone who is interested in machine learning
- General technical knowledge
- A Microsoft Azure account is recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
The GitHub repository for this course is at https://github.com/cloudacademy/azureml-intro.
Machine learning is a hot topic these days. It’s constantly in the news. It’s transforming everything from cybersecurity to customer service to music, and its benefits are being explored in nearly every industry imaginable. All of this excitement makes machine learning sound almost magical. But if you look under the hood, it’s actually a rather simple concept.
At a very high level, here’s how it works. You feed lots of real-world data into a program and the program tries to make generalizations about the data. It then uses these generalizations to make predictions when it’s given new data.
For example, after looking at lots of emails that have been labeled as being either spam or not spam, it can then analyze a new email and predict whether it should go in the junk email folder or not.
Quite often, machine learning is used for a task that doesn’t really sound like prediction, but it still is, in a way. For example, it could be used to look at a picture and say whether or not the picture contains a cat. That sounds more like identifying or classifying an object, but from the machine’s point of view, it’s a prediction because it doesn’t know for certain whether or not the picture contains a cat.
In this course, we’re going to use machine learning to predict the price of a given automobile. We’ll do that by feeding the prices of lots of different automobiles into a machine learning algorithm and getting it to make generalizations about how various features of automobiles affect their price. This is known as training a model.
The first step is to create an Azure Machine Learning workspace. This is just a place where we can put everything related to this project. In the Azure portal, search for machine learning. There it is. Then click Add. Let’s call our workspace mlcourse. For the resource group, create a new one, and call it mlcourserg.
Now click the Review + Create button and then the Create button. It’ll take a little while, so I’ll fast-forward. There, it’s done. Click the Go to resource button.
To train a model, we need to go to Experiments. We need to click here to launch the studio. The note above it suggests that this is going to change in the near future, so it may look different for you.
Now we’re in the studio. The drag-and-drop interface is called Designer, so click on that. Now click Easy-to-use prebuilt modules. Click here to rename it. Let’s call it Automobile price.
There’s a message over here saying we need to select a compute target. We have to specify that so it knows where to run the machine learning experiment once we’ve created it. We could do this later, but since it can take a while to spin up a compute target, let’s do it now.
Click Select compute target. We don’t have any existing ones, so select Create new. It suggests using this predefined configuration, which has 2 virtual CPUs, 7 gigs of memory, 8 gigs of storage, and 2 nodes. If you need to use a different configuration, then you can click here, but we’ll just use this predefined one. Let’s call it compute1. And click Save. It will take a while, so click Save and let it spin up in the background.
Okay, here’s how the designer works. On the left, you have what are called modules. To use them, you drag them over to the canvas. For example, the first thing we need to do is specify what dataset we want to feed into our model. For most projects, you’d need to import your data from somewhere else. But, luckily for us, there’s already some sample data available. It’s under Datasets. Here it is. Some automobile data has already been imported into this custom module.
Now drag it over to the canvas and drop it. Here’s the description of what’s in the dataset. It says it contains “Prices of various automobiles against make, model, and technical specifications.” That’s definitely what we need.
Let’s take a look at the data. Go to the Outputs tab and click the Visualize icon here. It contains lots of columns of data. If you click on a column, it’ll give you some statistics about the data in that column. For example, if we click on make, it tells us that there are 22 unique values in this column. That means there are 22 different automobile brands, such as Toyota and Honda.
Let’s try horsepower. This one has a lot more statistics because the values in the column are numeric. The mean, or average, of the values is about 104. The minimum value is 48, and the maximum is 288.
Now that we’ve specified our data source, we need to say which machine learning algorithm to use on it. As you can see, there are quite a few to choose from. At the moment, the designer supports three types of algorithms: regression, classification, and clustering. I’ll give you a brief overview of what they mean.
Regression is used when you want to predict a number. For example, suppose you need to come up with an estimated selling price for a home based on the sale prices of other homes in the past. Let’s say this graph plots the sale price of various homes against the square footage of each house. You can see there’s a pattern here. Generally speaking, the bigger the square footage, the higher the sale price. But the relationship isn’t exactly the same in all cases, so you need to come up with a formula that comes closest to matching all of these relationships. The technique to do this is called regression.
If you only wanted to base your sale price estimates on square footage, then you wouldn’t need to use machine learning. But suppose you wanted to take dozens of factors into account, such as the home’s age, neighborhood, distance to the nearest school, etc. Then you’d have a multi-dimensional graph that would be hard to visualize. This is where machine learning would shine.
A classification algorithm is used when you need to classify data into two or more categories. For example, suppose you need to classify pictures of pets into cats, dogs, and birds. €Some other examples are speech recognition (that is, listening to a spoken word and deciding which word it is) and classifying Twitter posts as happy, sad, angry, etc. As you can see, classification algorithms can be used for a wide variety of interesting problems.
Clustering algorithms are very different from the others. That’s because clustering is typically used for something called unsupervised learning. The other two types are examples of supervised learning, which is where you have labeled data. That is, each piece of training data has a label with the “answer”. In the house price example, the label would give the actual selling price of this particular home.
In unsupervised learning, the data is unlabeled, so the algorithm has to take a different approach to understanding the data. It looks for patterns. The clustering algorithm tries to group the data into different clusters. For example, it could look at purchase histories and then group your customers into different segments so you could market to each segment differently.
OK, now back to our automobile example. This is a regression problem since we need to predict an automobile’s price, which is a number. So now we need to decide which regression algorithm to use. The simplest one is Linear Regression, but Decision Forest usually works better, so we’ll choose that one.
Now we’re ready to train a model using this algorithm and this data. Go into the Model Training section, and drag Train Model over to the canvas.
It has two circles at the top, which means that it needs to receive inputs from two other modules. If you hover your mouse over each input, it’ll tell you what it needs. On the left one, it says “Untrained model” because it needs to know which model you want to train. If you hover over the output on the algorithm module, it also says “Untrained model”, so you can feed it into the training module by clicking and dragging from one to the other. Don’t worry about the exclamation point. We’ll deal with that in a minute.
Now have a look at the other training input. It needs a dataset, which is, of course, the output from the Dataset module, so hook that up. Now we have the beginning of what’s called a pipeline. We’re creating a pipeline that data flows through to train and evaluate our machine learning model.
Now we need to deal with that exclamation point. It says, “A value is required.” It wants to know what column it should use as a label field. That is, it needs to know which column gives the correct “answer” for each instance of data.
Click Edit column. Then click in the field and select price since that’s what we want the model to predict.
All right, now the training module has what it needs. Next, we need to add modules to test and evaluate the trained model. You can find these in the Model Scoring & Evaluation section. The first one we need is the Score Model module. It has two inputs: a trained model and a dataset. So connect the trained model on the left and the dataset on the right.
It might seem like scoring the model would be enough, but we also need to evaluate the model. I’ll show you the difference between scoring and evaluating after we run the pipeline.
This module also takes two inputs. One is “Scored dataset” and so is the other one. That’s kind of weird, isn’t it? Why would it need two scored datasets? Well actually, it doesn’t. The second input is optional. You’d use it if you want to compare two different models to see which one performed better. Since we only have one model, just connect the Score Model to the left input.
Okay, now we’re ready to run the pipeline, so click the Submit button. Our compute target should be ready now, so we should be able to select it. Click the Submit button again, and we need to give our experiment a name. Call it auto-price. It also wants a run description. You’ll typically do a training run, then make a modification to the pipeline, then do another training run, and so on, so you’ll want to keep track of what you changed every time. It already set the run description to Automobile price, which is fine for our first run. Click the Submit button. It’ll take a while, so I’ll fast-forward.
All right, it’s done. To see what the scoring module did, click on it, then click the Visualize button. If you scroll all the way to the right, you’ll see a new column called Scored Labels. It shows the model’s prediction for the price of each of the automobiles. The actual price is in the column to the left, so you can compare the two columns to see how close it came to predicting the correct price.
It looks like its predictions were reasonably close, but it would be nice to have a way to summarize the accuracy of this model with one number, wouldn’t it? That’s where the Evaluate module comes in. Close this, click on Evaluate Model, then click Visualize.
It shows five different numbers that tell us how accurate the model’s predictions were for that dataset. The easiest one to understand is the “Coefficient of Determination”. It’s also known as R squared. This is always a number between 0 and 1. A 1 means that the model’s predictions perfectly fit the data, so the .98 that we got is almost perfect.
That’s great, but there’s a potential flaw with the way we set this up. I’ll explain why in the next lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).