CloudAcademy
  1. Home
  2. Training Library
  3. Microsoft Azure
  4. Courses
  5. Introduction to Azure Machine Learning Workbench

Basic Data Preparation

The course is part of this learning path

Introduction to Azure Machine Learning
course-steps 2 certification 1 lab-steps 1

Contents

keyboard_tab
Introduction
2
Conclusion
8
play-arrow
Start course
Overview
DifficultyIntermediate
Duration1h 7m
Students178

Description

Course Description

Azure Machine Learning Workbench is a front-end for a variety of tools and services, including the Azure Machine Learning Experimentation and Model Management services.

Workbench is a relatively open toolkit. First, you can use almost any Python-based machine learning framework, such as Tensorflow or scikit-learn. Second, you can train and deploy your models either on-premises or on Azure.

Workbench also includes a great data-preparation module. It has a drag-and-drop interface that makes it easy to use, but its features are surprisingly sophisticated.

In this course, you will learn how Workbench interacts with the Experimentation and Model Management services, and then you will follow hands-on examples of preparing data, training a model, and deploying a trained model as a predictive web service.

Learning Objectives

  • Prepare data for use by an Azure Machine Learning Workbench experiment.
  • Train a machine learning model using Azure Machine Learning Workbench.
  • Deploy a model trained in Azure Machine Learning Workbench to make predictions.

Intended Audience

  • Anyone interested in Azure’s machine learning services

Prerequisites

Resources

The github repository for this course can be found here.  

Transcript

For our first example, we’re going to start with our old friend, the iris dataset. You’ll recall that it contains three species of irises: Iris setosa, Iris versicolor, and Iris virginica. The goal is to develop a model that will predict which species an individual flower is, based on lengths and widths of its petals and sepals. This is a very simple problem for machine learning to solve, but it’s useful for seeing how to work with the various parts of ML Workbench.

Launch Workbench. First, we have to create a new project. Click the plus sign to do that. Call the project “Iris”. Then tell it which directory to put your projects in. This one’s fine. Then select the “Classifying Iris” template and click Create.

Since we created this project from a template, it already contains lots of files. Click on the Files icon. You can see that it has a bunch of Python scripts and a few other types of files. If we had used a blank template, there would have been only a few basic files.

The iris dataset is in the iris.csv file. Normally, you’d store your data somewhere externally, but since the iris dataset is so small, it’s okay to have it in the project directory. Click on it to have a look at it. The first four columns contain the petal and sepal widths and lengths. The last column is the label. It says which species each individual flower is. If you scroll down, you can see that there are 50 rows for each of the 3 iris species.

To make it easier to work with the data, we need to turn the csv file into a data source. Click on the Data icon. This template already has a data source that contains the data from the csv file, but let’s create a new data source anyway, so you can see how to do it.

Click the plus sign and “Add Data Source”. It supports four sources of data. Although only the first one says “File”, all of the first three types are actually files. The first one is mostly for simple file formats, such as csv files. The Parquet file format is supported by Hadoop, Spark, and other software in the Hadoop ecosystem. The third option is Excel. And the fourth option, Database, is for AzureSQL and SQL Server.

Click on the first one since we’re using a csv file. Click Browse and File. Then go into the Workbench folder and then the Iris folder. That’s where the iris.csv file is. Click Next.

This is a delimited file, so you can leave it at that, but here are the other types you can choose.

The separator is a comma. We don’t need to skip any lines. The file encoding is utf-8. And there aren’t any headers in the file. So we don’t have to change anything on this page. Click Next.

This is where you can tell it the data type of each column. It already took a guess and got them all right. That is, the first four columns are numeric and the last column is a string. But what’s this Path column at the beginning? It contains the path of the data file for every row. This might be useful if we were using multiple data files, but we’re not, so we don’t need it. I’ll show you how to get rid of it in a minute. Click Next.

The Sampling page lets you say whether you want to include the entire dataset or not. If you’re dealing with a large dataset, you might want to include only a portion of it while you’re developing your model, so it won’t take as long to run. The iris dataset is tiny, so we should include the entire dataset. However, we don’t have to explicitly tell it to do that because it’s currently set to take the first 10,000 rows, and since the iris dataset only has 150 rows, that will definitely cover it. Click Next.

This is where you can say whether you want to include the Path column or not. Since we’re only using one data file, we won’t need this, so we’ll leave it at the default, which is to not include it. OK, click Finish.

Here’s the data in a nice table. On the right-hand side, it shows all of the options we chose. If you want to change any of them, you can just click on the arrow and select Edit.

Also notice that it named this data source “iris-1” because there’s already one called “iris”.

If you click on Metrics, it’ll give you some information about each column. First, it shows a histogram of all the values in the column, so you can see what the distribution looks like. It also tells you the maximum and minimum values, how many pieces of data are in the column, and lots of other statistical information.

One of the most important statistics is the “Number of missing values”. It says that there’s one missing value in each column. Let’s look at the data again to see what’s going on. I’ll scroll down to see what’s missing. Aha, there’s an extra row at the bottom that doesn’t have any data. We should get rid of it.

To make changes like this to the data, click Prepare. It needs to know what data preparation file to use. In the dropdown, select “New Data Preparation Package”. Call it “iris-1” so it matches the name of the data source.

To remove the empty row, scroll down to it, then right-click on one of the null values, and in the Filter menu, select “is not null”. Now only rows where Column5 is not null get through the filter.

There’s one more thing we should do while we’re here. We should rename the columns so we can tell what data they contain. You don’t actually need to do this to build and train a model, so why would you want to do it? Well, when you’re writing the code to train the model, it will be easier if you’ve named the columns. Also, when you deploy the trained model, you’ll want to have descriptive column names to make it easier to submit new data for predictions.

To rename a column, just double-click on the header. Rename them to “Sepal Length”, “Sepal Width”, “Petal Length”, “Petal Width”, and “Species”.

Notice that everything you did is shown in the list of steps on the right. If you need to make any changes to a step, you can edit it, delete it, or move it to a different place in the list.

When you’re finished, you can turn this into a Python script, if you want. To do that, right-click on “iris-1” under “Data Preparations” on the left. Then select “Generate Data Access Code File”. This creates a new file called “iris-1.py”. In the next lesson, you’ll see how this code gets used. So if you’re ready, go to the next video.

About the Author

Students12835
Courses41
Learning paths20

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).