1. Home
  2. Training Library
  3. Microsoft Azure
  4. Courses
  5. Introduction to Azure Machine Learning

Setting Up Datastores and Datasets

Contents

keyboard_tab
Introduction
1
Course Introduction
PREVIEW1m 11s
Using the Designer
2
Training a Model
PREVIEW14m 26s
Summary

The course is part of these learning paths

play-arrow
Start course
Overview
DifficultyBeginner
Duration50m
Students164
Ratings
4.6/5
starstarstarstarstar-half

Description

Machine learning is a notoriously complex subject that usually requires a great deal of advanced math and software development skills. That’s why it’s so amazing that Azure Machine Learning lets you train and deploy machine learning models without any coding, using a drag-and-drop interface. With this web-based software, you can create applications for predicting everything from customer churn rates to image classifications to compelling product recommendations.

In this course, you will learn the basic concepts of machine learning and then follow hands-on examples of choosing an algorithm, running data through a model, and deploying a trained model as a predictive web service.

Learning Objectives

  • Create an Azure Machine Learning workspace
  • Train a machine learning model using the drag-and-drop interface
  • Deploy a trained model to make predictions based on new data

Intended Audience

  • Anyone who is interested in machine learning

Prerequisites

  • General technical knowledge
  • A Microsoft Azure account is recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)

Resources

The GitHub repository for this course is at https://github.com/cloudacademy/azureml-intro.



Transcript

When we built a pipeline to predict automobile prices, we were able to get the data from a sample dataset module, but that’s not something you’d do for a real project. Normally, you have to import data from another source, such as Azure Storage. 

I’m going to show you how we’d import the automobile dataset from its original source rather than from the sample dataset. Let’s have a look at it before we import it. There’s a link to it in the readme file in the GitHub repository for this course. This page gives an overview of the dataset. Click Data Folder to get to the actual data. It’s this one. This should look familiar, although it’s in a different format from the way we saw it in the Designer.

Now let’s import it. First, create a new pipeline draft. Select a compute target. This is the one we created before, so select it. 

Now, if you open the Data Input and Output section, you’ll see a module called Import Data. If you put your mouse pointer over it, you’ll see a description. It lets you load data from web URLs or from various Azure services, such as SQL Database or Blob storage.

The easiest way is to use a web URL. Drag the module over to the canvas. Change the data source to URL via HTTP. For the data source URL, you can copy and paste it from the GitHub repository.

First, it validates the data. Okay, it’s validated. Now click the Preview schema button to see the columns it found. The columns are just named Column1, Column2, etc. That’s because the original dataset didn’t include names for the columns. There’s also one called Path at the top. That’s the path to the dataset, which isn’t something we need to include in the dataset, so leave it unchecked.

It took a guess as to what type of data is in each column. It’s always a good idea to go through these guesses and make sure they’re correct because it does get the type wrong sometimes. For example, it thinks Column2 is a string. If we look at the original dataset, we can see that this column is actually an integer. It guessed that it was a string because the first value is a question mark, which isn’t a number. So, if we change the data type of Column 2 to Integer, then it will interpret the values correctly, and it will interpret the question marks as Not a Number. Some of the other data types are wrong, too, but we’re not going to use this data, so just leave them.

If we want to look at the data, we have to run the pipeline. This is different from how we looked at the data in the sample dataset we used before. With that one, we could look at the data without running the pipeline. Okay, it’s done. Now we can click Visualize. It looks the same as it did with the sample dataset except for the column names. By the way, it’s possible to change the column names by using the Edit Metadata module.

So that’s one way to import data from a web URL. Now I’ll show you how to import data from Azure blob storage. It’s usually better to have your data in an Azure storage service than to access it over the web, especially if it’s a large dataset.

Let’s upload the automobile data into blob storage and use that as an example again. Go back to the URL for the original dataset, then download it to your desktop. Then go to the Azure portal. Go to Storage accounts, and select the one that starts with mlcourse. It was created when we created the workspace.

Then click Containers and add a container. Let’s call it automobiles. And click Create. Now click Upload and select the file you downloaded before.

Okay, then go back to the Azure Machine Learning studio. To import data from an Azure service, you’ll need to create a datastore. An Azure ML datastore simply contains the connection information to an actual datastore.

Click Datastores in the left-hand menu. There are already a few datastores here. These two were created when I set up the workspace, and this one was created when I added the sample automobile dataset module.

To create a new one, click New datastore. Call it automobiles. The datastore type is already set to Azure Blob Storage, which is correct, but let’s have a look at what else we could connect to. Some of the other options are different types of Azure Storage and relational databases.

For the storage account, select the one starting with mlcourse. For the blob container, select automobiles. For the authentication type, you can either use an account key or a shared access signature. Leave it set to Account key. To get the key, go back to the Azure portal, go to the storage account, and click Access keys. Copy one of the keys, and paste it into the Account key field. Then click Create. This registers the datastore in your workspace.

Now to use a datastore, you need to create a dataset. An Azure Machine Learning dataset is simply a pointer to a specific file in a datastore or other data source. If you click Create dataset, you’ll see that there are a number of different types of data sources, such as web files. When we used the Import Data module earlier and pointed to the file on the Irvine University website, we could have created a dataset here instead and pointed it to that file. I’ll explain later why it might have been better to do it this way.

Okay, select From datastore. Let’s call the dataset automobiles as well. There are two types of datasets: Tabular and File. A Tabular dataset is what we’ve been dealing with so far. When you select this option, it parses the file and turns it into a table with rows and columns. If you select File, then it will just reference the raw file or files. Select Tabular. And click Next.

Select the automobiles datastore and click Select datastore. Then click the Browse button, select the file, and click Save. Click Next. This is a preview of what the table will look like. If it doesn’t match what you’re expecting, then you can change the settings to get it to work. For example, if it weren’t a comma-delimited file, then we’d need to set the delimiter. The table looks fine, so click Next.

Now we can verify the data types for the columns. As you know, some of these are wrong, such as the second column, but we’ll just leave them as is since we won’t actually be using this dataset. Click Next. And click Create.

All right, so we’ve created a datastore and a dataset. Now how do we use the dataset? Go back to the Designer, and select the pipeline we were working on. Now open the Datasets section. There’s the automobiles dataset that we just created. Drag it over. Click on it, click on Outputs, and click Visualize. It’s more or less the same as what we saw before.

Okay, we’ve looked at a few different ways to import the same data, so which one should you use? The easiest way was to use the Import Data module and specify a web URL. But is that the best way? There are two potential issues with doing it that way. First, as I mentioned earlier, if it’s a large dataset, then it would be slow to load it over the web. Putting it in Azure Storage would have higher performance. Second, creating an Azure ML dataset gives you additional benefits that the Import Data module doesn’t have.

First, you can create multiple versions of the same dataset. This is helpful for keeping track of which version of a dataset was used for a particular training run. Second, you can monitor a dataset for data drift. What does that mean? When you deploy a trained model, it will receive requests to make predictions on new data. If the patterns in this new data start to drift away from the patterns in the data that was used to train the model, then the model’s predictions will get worse over time. When you monitor a dataset, you create a baseline dataset and then compare new data to it on a regular basis. You can set up alerts to tell you if it detects data drift between the baseline dataset and the new data. 

So, in summary, there are real benefits to storing your data in an Azure service, such as Azure Storage, and to reference that data by creating Azure Machine Learning datastores and datasets.

And that’s it for this lesson.

About the Author
Students55235
Courses61
Learning paths63

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).