Using Azure ML Studio
The course is part of this learning path
Machine learning is a notoriously complex subject, which usually requires a great deal of advanced math and software development skills. That’s why it’s so amazing that Azure Machine Learning Studio lets you train and deploy machine learning models without any coding, using a drag-and-drop interface. With this web-based software, you can create applications for predicting everything from customer churn rates, to image classifications, to compelling product recommendations.
In this course, you will learn the basic concepts of machine learning, and then follow hands-on examples of choosing an algorithm, running data through a model, and deploying a trained model as a predictive web service.
- Prepare data for use by an Azure Machine Learning Studio experiment
- Train a machine learning model in Azure Machine Learning Studio
- Deploy a trained model to make predictions
- Anyone who is interested in machine learning
- No mandatory prerequisites
- Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
This Course Includes
- 54 minutes of high-definition video
- Many hands-on demos
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).
We’re going to start with a classic set of data called the iris dataset. In 1936, a man named Ronald Fisher developed a statistical model to distinguish these three different species of iris flowers from each other based on the lengths and widths of their petals and sepals. The three species are Iris setosa, Iris versicolor, and Iris virginica.
We’re going to build a machine learning model that can distinguish between two of these species. Why only two of the species instead all three? Because Azure ML Studio includes a sample iris dataset that only has examples of two of the species in it: Iris setosa and Iris virginica. We could import the original dataset from another website, but it’ll be easier to just use the sample dataset while we’re getting started.
The idea is that we will feed a bunch of examples of irises to our model and tell it which species each one is. Then it will try to figure out rules for classifying them so that if we give it some new examples of irises without telling it which species they are, it should be able to classify them correctly.
To follow along and build this model yourself, go to studio.azureml.net. Then click “Sign up here”. The “Guest Workspace” option is nice because you don’t have to create an account. Unfortunately, it has some limitations that would prevent you from completing one of the examples later on in this course, so choose “Free Workspace” instead.
If you don’t already have an Azure subscription, then click here and it’ll take you through the signup process. I’ll assume that you’ve already created an Azure account. Now enter your Azure ID and password. It’ll take you to ML Studio.
The place to create a machine learning model is called an experiment. If you’re not already on the Experiments page, then click “Experiments” on the left. Then click “New” to create a new one. You can either start from scratch or use one of these templates. Click “Blank Experiment”.
Here’s how it works. On the left, you have what are called modules. To use them, you drag them over to the canvas. For example, the first thing we need to do is specify what dataset we want to feed into our model. Click on “Saved Datasets” and then on “Samples”.
There are a few dozen sample datasets here that we can use without having to import data from somewhere else. Scroll down until you see “Iris Two Class Data”. Then drag it onto the canvas.
You can have a look at the data in a couple of ways. If you right-click on the module, there’s a menu called “dataset”. Select “Visualize”. This will show you the first 100 rows of the dataset. In this case, there are only 100 rows altogether, so this is the entire dataset.
Each row represents one individual flower. There are columns for sepal length, sepal width, petal length, and petal width. There’s also a column called “Class”. This is known as the label column. It says which species each flower is. A zero means Iris setosa and a one means Iris virginica.
The other way to look at the dataset is to click the “View dataset” link in the Properties pane. This will actually download the dataset to your computer. This particular dataset is in ARFF format, which is basically just a text file, so if you want to look at it, you can open it with a text editor.
Now that we’ve specified our data source, we need to say which machine learning algorithm to use on it. Click “Machine Learning”. Then “Initialize Model”.
ML Studio supports four types of algorithms: regression, classification, anomaly detection, and clustering. I’ll give you a brief overview of what they mean.
Regression is used when you want to predict a number. For example, suppose you need to come up with an estimated selling price for a home based on the sale prices of other homes in the past. Let’s say this graph plots the sale price of various homes against the square footage of each house. You can see there’s a pattern here. Generally speaking, the bigger the square footage, the higher the sale price. But the relationship isn’t exactly the same in all cases, so you need to come up with a formula that comes closest to matching all of these relationships. The technique to do this is called regression.
If you only wanted to base your sale price estimates on square footage, then you wouldn’t need to use machine learning. But suppose you wanted to take dozens of factors into account, such as the home’s age, neighborhood, distance to the nearest school, etc. Then you’d have a multi-dimensional graph that would be hard to visualize. This is where machine learning would shine.
A classification algorithm is used when you need to classify data into two or more categories. For example, suppose you need to classify irises into two different species. Hey, that sounds familiar. Some other examples are classifying email as normal or spam, speech recognition (that is, listening to a spoken word and deciding which word it is), and classifying Twitter posts as happy, sad, angry, etc. As you can see, classification algorithms can be used for a wide variety of interesting problems.
Anomaly detection is used to find data that’s unusual compared to “normal” data. For example, you could use it to flag potentially fraudulent credit card transactions or suspicious network activity. You might be wondering why you couldn’t use classification algorithms for this. After all, you want to classify data as either normal or unusual. The difference is that the unusual data points are typically so rare that it would be too difficult for a classification algorithm to learn what unusual data looks like. Anomaly detection algorithms take a different approach. They learn what normal data looks like and then flag anything that doesn’t look normal.
Clustering algorithms are very different from the other three types. That’s because clustering is typically used for something called unsupervised learning. The other three types are examples of supervised learning, which is where you have labeled data. That is, each piece of training data has a label with the “answer”. In the house price example, the label would give the actual selling price of this particular home.
In unsupervised learning, the data is unlabeled, so the algorithm has to take a different approach to understanding the data. It looks for patterns. The clustering algorithm tries to group the data into different clusters. For example, it could look at purchase histories and then group your customers into different segments so you could market to each segment differently.
OK, now back to our iris example. This is a classification problem since we need to classify each flower by its species. So open the Classification menu. Wow, that’s a lot of different algorithms. How do you know which one to choose? This is such a common question that Microsoft created a cheat sheet.
There’s a section for each of the algorithm types. We need classification, but it’s actually divided into two sections, two-class and multiclass. Our example is a two-class problem, so that narrows down our choices a little bit. There are still nine algorithms to choose from, though.
For each algorithm, it gives a very brief description of its characteristics. For example, for Two-class SVM, it says greater than 100 features and linear model. By features, it means the number of columns for each row of data. For example, square footage and age are features of a house. If your training data has a lot of features, then some algorithms could have difficulty with it. The cheat sheet say that the Two-class SVM is a good choice when you have more than 100 features.
When it says linear model, it means that it draws a straight line through the data. A linear model works well for relationships like square footage to house price, but not so well for relationships like time of day to coffee sales.
The algorithms on the left are all linear models. The ones on the right aren’t. You can always try different algorithms and compare the results to see which is best for a particular dataset, but if you’re short on time, I recommend using Decision Forest. It trains quickly and it has high accuracy, so it’s a good choice for most classification problems. In fact, it even works well for regression problems.
OK, so let’s choose Two-class Decision Forest and drag it onto the canvas. Then open the Train menu and drag “Train Model” over as well.
The Train Model module takes two inputs. If you hover your mouse over each input, it’ll tell you what it needs. On the left one, it says “Untrained model” because it needs to know which model you want to train. If you hover over the output on the algorithm module, it also says “Untrained model”, so you can feed it into the training module by clicking and dragging from one to the other. The arrow won’t connect unless you’re right over the circle and it shows the description. Don’t worry about the exclamation point. We’ll deal with that in a minute.
Now have a look at the other training input. It needs a dataset, which is, of course, the output from the Dataset module, so hook that up.
Now we need to deal with that exclamation point. It says, “Value required.” It wants to know what column it should use as a label field. That is, it needs to know which column gives the correct “answer” for each instance of data.
In the Properties pane, click “Launch column selector”. “Class” is the label column, so select it and then click the right-arrow button to move it to “Selected Columns”. Then click the checkmark.
Alright, now the training module has what it needs, so click the Run button. Now it’ll run the dataset through the untrained model to create a trained model.
There’s a spinner that says “Queued” because this training job is going to run on Azure infrastructure. The infrastructure for ML Studio is shared between all customers, so there’s a queue for all incoming jobs. It usually doesn’t take too long before the job runs, though, and this job is quite small, so it won’t take long to run.
Alright, it’s done. To see the result, right-click on the output and select “Visualize”. This shows what the trained model looks like. You don’t normally need to look at the details of a trained model, but when you’re getting started with machine learning, it’s helpful to understand what it’s actually doing.
The way Decision Forest works is it creates a bunch of decision trees, such as this one. Because this is such a simple example, the decision tree it came up with is the simplest one possible. It says that if the petal length is less than or equal to 1.9, then the flower is in Class 0 and if it’s greater than that, then it’s in Class 1.
Click on another tree and see what it does. This one is actually the same as the first one, so let’s try another one. OK, this one is different. It’s a little more complicated than the first one.
The reason the algorithm is called Decision Forest is because it generates a number of different trees. Then it runs each piece of data through all of the trees and goes with what the majority of them say is the right classification.
To see how well this model performs on the dataset, we need to score it, so let’s get out of here and open the score menu. Then drag the “Score Model” module over. It has two inputs: a trained model and a dataset. So connect the trained model on the left and the dataset on the right. Now click “Run selected”.
OK, now visualize the output. This time it shows the dataset with two new columns: Scored Labels and Scored Probabilities. You can ignore the Scored Probabilities column in this example, but have a look at the Scored Labels. It shows how the Decision Forest classified each of these flowers. If you go down the list, you can see that it classified all of the flowers on this page correctly.
Normally, a dataset will have far more than 100 rows in it and the accuracy is almost never this high, so you can’t easily eyeball the results to see how well the model did. Instead, we need to open the Evaluate menu and drag “Evaluate Model” over.
It takes two inputs. One is “Scored dataset” and so is the other one. That’s kind of weird, isn’t it? Why would it need two scored datasets? Well actually, it doesn’t. The second input is optional. You’d use it if you want to compare two different models to see which one performed better. Since we only have one model, just connect Score Model to the left input. Then click “Run selected” again.
OK, now visualize the output. This graph is pretty weird because of the simplicity of this example, so let’s ignore it and scroll down. The key number here is accuracy. A 1.0 means that the model was 100% accurate on this dataset.
That’s great, but there’s a potential flaw with the way we set this up. I’ll explain why in the next lesson. Let’s save this model because we’re going to modify it in the next lesson. First, change the title to “Iris Example” or whatever you want. Then click “Save”.
Alright, I’ll see you in the next lesson.
Azure Machine Learning Studio: https://studio.azureml.net
Algorithm Cheat Sheet: https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-cheat-sheet