CloudAcademy
  1. Home
  2. Training Library
  3. Microsoft Azure
  4. Courses
  5. Introduction to Azure Machine Learning Workbench

Advanced Data Preparation

The course is part of this learning path

Introduction to Azure Machine Learning
course-steps 2 certification 1 lab-steps 1

Contents

keyboard_tab
Introduction
2
Conclusion
8
play-arrow
Start course
Overview
DifficultyIntermediate
Duration1h 7m
Students178

Description

Course Description

Azure Machine Learning Workbench is a front-end for a variety of tools and services, including the Azure Machine Learning Experimentation and Model Management services.

Workbench is a relatively open toolkit. First, you can use almost any Python-based machine learning framework, such as Tensorflow or scikit-learn. Second, you can train and deploy your models either on-premises or on Azure.

Workbench also includes a great data-preparation module. It has a drag-and-drop interface that makes it easy to use, but its features are surprisingly sophisticated.

In this course, you will learn how Workbench interacts with the Experimentation and Model Management services, and then you will follow hands-on examples of preparing data, training a model, and deploying a trained model as a predictive web service.

Learning Objectives

  • Prepare data for use by an Azure Machine Learning Workbench experiment.
  • Train a machine learning model using Azure Machine Learning Workbench.
  • Deploy a model trained in Azure Machine Learning Workbench to make predictions.

Intended Audience

  • Anyone interested in Azure’s machine learning services

Prerequisites

Resources

The github repository for this course can be found here.  

Transcript

In the iris example, the data was quite simple and it was already in the project folder, so we didn’t need to do much preparation. In this lesson, we’re going to use a more complex set of data.

Suppose you work for Hubway, a bike sharing service in the Boston area. Hubway has a fleet of bikes in docking stations around the city. People can buy a pass or a membership and then ride a bike from one station to another. They can take an unlimited number of trips, as long as they’re less than 30 minutes.

You’ve been asked to build a machine learning model using a combination of historical weather and bike trip data. Hubway is hoping to deploy a model that will predict demand over the next 24 hours, in 2-hour windows. In this lesson, we’ll only be going through the data preparation phase and won’t be building a model.

Here’s an overview of what we’ll be doing. First, we’ll import the trip and weather data from external sources. Some of the data will be in multiple files, one for each month. We’ll remove unnecessary rows and columns, as well as data containing errors. Next, we’ll summarize the data into 2-hour windows in a new column for each dataset. Then we’ll merge the two datasets into one. Finally, we’ll create some new columns, such as the day of the week.

Alright, first let’s download the Hubway trip data. Here’s the link. To save you from having to download and extract too many files, let’s just download the January data for the years 2015 - 2017. Then unzip all of them.

Next, download the weather data. It’s all in one file. Let’s have a look at it. It has data from the beginning of 2015 to the end February, 2017. It contains the temperature, the humidity, and the wind speed. It also has a column called REPORTTYPE, although it’s misspelled. I’ll tell you more about this column later.

Now let’s have a look at one of the Hubway files. It has a lot more columns. It contains information about each trip, such as the start and end times, and the start and end stations. It also includes information about each rider, such as year of birth and gender. Some of this information won’t be necessary for our model, so we’ll get rid of it later.

OK, open ML Workbench and create a new project. Call it “BikeShare”. This time, we’re going to start with a blank project.

Now click the Data icon and add a Data Source. Click File and Next. Click Browse, File, and then select the BostonWeather.csv file you downloaded. Click Next. Verify that it got all the parameters right. These look good, so I’ll click Next.

Here, it guessed that all of the columns contain strings except for the date column, which, not surprisingly, is set to Date. Everything is correct, so click Next.

This time, there are over 30,000 rows of data in the file, so we don’t want to accept the default of just the first 10,000 rows. Click New, then change the Sample Strategy to “Full File”. Click Apply, then click “Set as Active”. The star is next to “Full File” now. Click Next.

We don’t need to include the Path Column, so leave this and click Finish. Now let’s do some cleaning by removing the rows and columns we don’t want.

First, let’s look at the temperature, humidity, and wind columns. They’re all numbers, but for some reason, Workbench thought they were strings. That suggests that maybe there are some non-numeric values in these columns. Click the Metrics button to see if that’s the case. Yes, there are about 800 values in each of those three columns that aren’t numbers. That explains why Workbench made these columns strings instead of numeric. In fact, the most common value in the temperature column is “NA”. There are 791 rows with NA in that column, which is almost all of the values that aren’t numbers in that column.

We should get rid of the rows that have non-numeric values. Click Prepare and type “BikeShare” for the dataprep package name. Select the temperature, humidity, and wind columns by holding down the Control key on Windows or the Command key on a Mac. Right-click and select “Convert Field Type to Numeric”.

Notice that the three columns now have quite a bit of red in the “Data Quality Bar”. If you hover over their headers, you can see how many errors it found. In fact, they have the same number of errors as what we saw in the “Not a Number” column in the Metrics, which makes sense.

To remove the rows with errors in them, select the temperature column, right-click on the header, and select “Filter Column”. Leave “I Want To” on “Keep Rows”. Then for the condition, select “is not error”. You’d check the “Create Dataflow Containing the Filtered Out Rows?” box if you wanted to keep the filtered out rows and do something with them. We don’t need to do that, so don’t check the box. Click OK.

The red line disappeared from the Data Quality Bar. Now add the same filter to the humidity and wind columns. OK, all the errors are gone.

Now, remember this REPORTTYPE column? The records in this dataset came from different types of weather reports. Right-click the header and select “Value Counts” to see how many types there are. It says there are only four unique values in this column. And about 19,000 of them (which is about two-thirds of all of the records) are FM-15 reports. If you look at the data, you can see that there’s an FM-15 record every hour at 54 minutes past the hour. But the FM-12 records are only every three hours. To use consistent, hourly weather records, let’s keep the FM-15 reports and get rid of the rest.

First, let’s hide the Inspector pane. Then Right-click on FM-15 in one of the records and go into the Filter menu. Select “equals”. We just filtered out all of the rows where the REPORTTYPE did not equal FM-15.

Since all of the values in this column are now the same, we don’t need it anymore. Right-click on the header and select “Remove columns”.

You might have noticed a potential problem with all of our filtering. We want one weather record per hour, but we filtered out rows that had errors in the three numeric columns, so there’s a chance that we don’t have data for every hour anymore. One way to check is to compare the number of rows we have now to the number of FM-15 rows that were in the original dataset. There are only 8 fewer records in our filtered dataset, so we didn’t remove very many. That’s because almost all of the rows that had errors were not FM-15 rows. I should point out, though, that the original dataset was already missing FM-15 records for 17 hours. If we wanted to be extremely thorough, we could insert reasonable numbers for the missing records, but we’re not going to do that here.

Now that we’ve cleaned the data, let’s summarize it into 2-hour time periods. Right-click the Date column and select “Derive Column by Example”. This is a powerful feature. It lets you give an example of what data in the new column should look like and Workbench will try to derive some rules for how to create all of the values in the new column.

Now type the date in the blank cell in the format of “Jan 1 2017”. We’re changing the format to avoid any confusion about which number is the month and which is the day. Then add a space and “12AM-2AM”, because that’s the 2-hour window that the first record falls into. When you hit the Enter key, it looks through the column to see if it can derive all of the values. It searches for edge cases to make sure it has the right rules. If it needs help, it will show this “Review next suggested row” link. Click on it.

It wants to make sure that it figured out the right rules. It figured this one out properly, and it also now says “No suggestions” up here, so if everything looks okay, click OK. Double-click on the header to rename the column. Call it “Date Hour Range”.

Now that all of the data has a 2-hour window, we can average the 2 values in each 2-hour period. Before we do that, let’s get rid of the original Date column, since we don’t need it anymore.

OK, now go into the Transforms menu and select Summarize. We’re going to group the data by the hour range, so drag that column into the left-hand side. The data we’re going to summarize is in the temperature, humidity, and wind columns, so drag those into the right-hand side. We want to average the values, so select “Mean” in the Aggregate dropdown for each of the three columns. Then click OK.

Now there’s only one record for each 2-hour window and it contains the average of the two values that were there before for each of the three columns. OK, that’s it for the weather data, so let’s move on to the bike trip data.

Click the plus sign and “Add Data Source”. They’re text files again. Click Browse and File. Select the three trip files. All of these parameters are correct except for this one. All three files have the same headers, so select this option. Otherwise, it would interpret the headers in the second and third files as data records, which would mess things up.

It did a good job of figuring out what type of data’s in each column. You might think the birth year should be a date instead of a string, but if you try to change it to Date , it’ll give an error, because a year isn’t enough for it to be considered a date. We could change it to numeric, but let’s just leave it as a string, because we’re not going to use this column anyway.

Just like with the weather data, we need to change this to “Full File” so it doesn’t take only the first 10,000 rows. Remember to click “Set as Active”.

This time we have multiple files, so should we include the Path column? Well, the only useful information in the filename is the year and month, but there are already dates in a couple of the other columns, so we don’t need anything from the filename. Click Finish.

Alright, let’s get to work. Click the Metrics button so we can see if there are any problems with the data. Have a look at the “Number of missing values” column. Wow, this is very clean data. There aren’t any missing values. Now let’s see how many “not a number”s there are. The only columns that have anything other than zero are the ones that aren’t numeric, so we’re good here too.

We don’t need all of these columns, though, so let’s remove some of them. Click the Prepare button. We want to add this to the dataprep package we just created for the weather data, so leave it with this. It’s showing the weather data. To see the trip data, click on Dataflows. There it is. It named the dataflow after the first filename, which is misleading because we actually have three years’ worth of January data in it. If you want to rename it, right-click and select Rename. I’ll call it “2015-2017 January hubway-tripdata”.

OK, which columns don’t we need? Well, our machine learning model will only be concerned with the start time and the start station id, so let’s select those and remove everything else. Right-click in the header of one of those two columns and select “Keep Columns”. All the other columns got deleted.

Now we need to create a column of 2-hour time windows again, so select the starttime column, right-click on it, and select “Derive Column by Example”. OK, now use the same format we used for the weather data, so type “Jan 1 2017”, and add “ 12AM-2AM”. Then click “Review next suggested row”. That one looks good. It’s taking quite a while to analyze the data, so I’m going to fast-forward every time.

Click it again. That one looks good too. That’s one’s fine too. Boy, it really wants to check a lot of them this time, for some reason. Alright, it’s finally saying, “No suggestions”, so click on OK.

Rename the column to “Date Hour Range”. Then remove the starttime column. Now we just need to summarize the data in each 2-hour period. Select Summarize from the Transforms menu. Then drag “Date Hour Range” and “start station id” into the left side, because we want to count the number of trips that started at each station during each 2-hour period. Now drag “start station id” into the right side. This time we want to add up the number of trips in each period, so leave it as Count. Then click OK.

Alright, we’re finished preparing both the weather and the trips data, so now we need to join them into one table. In the Transforms menu, select Join. It already selected the trip data for the left, so select the weather data for the right. This means that the joined dataset will have the trip columns first and then the weather columns. If you want them the other way around, then put the weather on the left and the trips on the right. Click Next.

Now we need to select the key column in each table. That is, it needs to know which column has the same values in the two tables, so it can join them together using that key. Select “Date Hour Range” in each of the tables. The original date columns in these tables used different date formats, so if we hadn’t created a Date Hour Range column with the same format in each table, then we wouldn’t have been able to join the two tables. Click Next.

The top pane shows you how many matching rows there are in the two tables, as well as how many rows in each table didn’t match. The bottom pane shows you what will be in the new table. It looks good, so click Finish.

Notice that we now have two copies of the Date Hour Range column, one from each table. They’re exactly the same, so we don’t need both. Right-click on the second one and select “Remove Columns”.

Finally, let’s create some new features that will hopefully help our machine learning model. It would probably make more sense to have separate features for the month, day, year, and time period. It would also be very helpful to have a feature for the day of the week, because that has a major impact on how many bike trips people make.

To split the Date Hour Range column, right-click on it and select “Split Column by Example”. It took a guess at what we wanted and split it into four columns, but it put the wrong data in each of the four columns, so click “Advanced Mode”. We can tell it what delimiters to use to do the split, here. If we put in a space and click Apply, then it will split it into the month, day, year, and hour range, which is what we want. Click OK to accept it. The columns have generic names, so let’s rename them. Whoops, I should call this “Hour Range”.

Now, to add a column for the day of the week, right-click on “Date Hour Range” and select “Derive Column by Example”. January 1st, 2017 was a Sunday, so enter “Sun” in the first blank cell. It was smart enough to figure out what we wanted, and it’s saying “No suggestions”, so click OK. Then rename the column as “Day of Week”.

Finally, let’s remove the Date Hour Range column.

Now the data’s ready for a machine learning model. To generate sample code for using this dataflow, right-click on the BikeShare dataprep and select “Generate Data Access Code File”.

There’s just one more thing you have to do. The dataflow index is set to 0, which means this code will run the first dataflow, which is the BostonWeather dataflow. We need to run the “Join Result” dataflow, so change the index to 2 and save the file.

And that’s it for this lesson.

About the Author

Students12835
Courses41
Learning paths20

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).