The course is part of this learning path
Using Azure ML Studio
Machine learning is a notoriously complex subject, which usually requires a great deal of advanced math and software development skills. That’s why it’s so amazing that Azure Machine Learning Studio lets you train and deploy machine learning models without any coding, using a drag-and-drop interface. With this web-based software, you can create applications for predicting everything from customer churn rates, to image classifications, to compelling product recommendations.
In this course, you will learn the basic concepts of machine learning, and then follow hands-on examples of choosing an algorithm, running data through a model, and deploying a trained model as a predictive web service.
- Prepare data for use by an Azure Machine Learning Studio experiment
- Train a machine learning model in Azure Machine Learning Studio
- Deploy a trained model to make predictions
- Anyone who is interested in machine learning
- No mandatory prerequisites
- Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
This Course Includes
- 54 minutes of high-definition video
- Many hands-on demos
In the last lesson, we built a model that had 100% accuracy, which seems great, but usually isn’t. The problem is that we used the same data to evaluate the model that we used to train it. Why is this a problem? Suppose you have these data points and you want to do a regression on them. If you were to do a linear regression, then it would look something like this. But suppose you tried to get a higher accuracy by doing a nonlinear regression and you ended up with this. This model would perfectly fit the training data, but if you were to run some new data points through the model, it would likely have lower accuracy than the linear model.
This is called overfitting and it’s something you always have to watch out for in machine learning. In the iris example, we did a classification rather than a regression, so it might not be obvious what overfitting would look like in this case.
Suppose your model was a single decision tree rather than a decision forest. Then suppose that it was a very complex tree that fit each data point exactly. It would look something like this. Now if you were to run new data through this decision tree, it would likely have a much lower accuracy because the new data points probably wouldn’t match any of the data points that were used to train this model.
One critical way to reduce the risk of overfitting is to evaluate the model using a separate test dataset. That way, the accuracy score will be much lower if there’s an overfitting problem. The easiest way to come up with a test dataset is simply to split the original dataset into two pieces. You’d typically put 70-80% of the data in the training dataset and the other 20-30% in the test dataset.
Let’s do that with the iris data. If you don’t still have the iris experiment open, then click on Experiments and select it. Now open the “Data Transformation” menu and then the “Sample and Split” menu. Drag the “Split Data” module over. Let’s make some room for it here.
OK, now in the Properties pane, change the “Fraction of rows in the first output dataset” to 0.8 because we want to put 80% of the data into the training dataset. Also make sure that “Randomized split” is checked. Otherwise, it will simply put the first 80% of the rows in the training dataset, which would be a problem if the dataset is sorted in some way.
Now delete the existing arrows by right-clicking them and selecting Delete. Alright, now connect the dataset to the Split Data module. Then connect the left output, which is 80% of the data, to the Train Model module, and connect the right output to the Score module. Now click Run.
Alright, it’s done. Let’s have a look at the evaluation. It’s still 100% accurate, so we probably don’t have an overfitting problem. One reason is probably because the iris dataset is straightforward and is easy to model properly. Another potential reason, though, is that the Decision Forest algorithm is designed to avoid overfitting.
Each decision tree is created from a random subset of the data, so no single decision tree can perfectly fit all of the data points. Then the algorithm runs new data through all of the trees to come up with a classification, which makes overfitting less likely. It’s still possible, though, especially with complex data, which is why you should always use a separate dataset to test your model.
And that’s it for overfitting.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).