Training a Model from a File Dataset
Start course
1h 23m

Learn how to operate machine learning solutions at cloud scale using the Azure Machine Learning SDK. This course teaches you to leverage your existing knowledge of Python and machine learning to manage data ingestion, data preparation, model training, and model deployment in Microsoft Azure.

If you have any feedback related to this course, please contact us at

Learning Objectives

  • Create an Azure Machine Learning workspace using the SDK
  • Run experiments and train models using the SDK
  • Optimize and manage models using the SDK
  • Deploy and consume models using the SDK

Intended Audience

This course is designed for data scientists with existing knowledge of Python and machine learning frameworks, such as Scikit-Learn, PyTorch, and Tensorflow, who want to build and operate machine learning solutions in the cloud.


  • Fundamental knowledge of Microsoft Azure
  • Experience writing Python code to work with data using libraries such as Numpy, Pandas, and Matplotlib
  • Understanding of data science, including how to prepare data and train machine learning models using common machine learning libraries, such as Scikit-Learn, PyTorch, or Tensorflow


The GitHub repo for this course, containing the code and datasets used, can be found here: 


We've seen how to train a model using trading data in a Tabular Dataset. We're gonna take a look at how to get this done using a file data set. When we use the file data set, the dataset input pass to the script represents a mount points containing file pots. How we read the data from these files depends on the kind of data in the files, and what we want to do with it.

In the case of the diabetes CSV files, we can use the Python glob module to create a list of files in the virtual mount point defined by the data set and read them all into pandas data frames, that are contraindicated into a single data frame.

Let's start by creating the folder for experiment files, we then need to set up a script that trains the classification model by using the file dataset that is passed to it as input. The first step in our script is import the necessary libraries, note that we've important glop, as I mentioned earlier. Then set up the code that allows to get receive the hyper parameter for regularization, We then get the run context, and then we'll load the diabetes dataset.

Next, we separate features and labels, and then we also split the data into a training set and test set. We then train, our classification model, we then calculate the accuracy and we lock that we also calculate the area under the curve and we lock that.

Next, we save our model and we complete the run. Next, we need to change the way we pass the data set to the estimator, it needs to define a mount point from which the script can read the files, a large volumes of data, you generally use the Ask mount method to stream the files directly from the datasets source. I wouldn't run it on a local compute, as we see in this example, we need to use the Ask download option to download the data set files into a local folder. Also, since it is set class is defined in Azure ML Data Prep package, we need to include that in the experiment environment.

So let's import the relevant classes for setting up our experiment estimator. We need to set the script parameters, and then get the diabetes training data set. So we create our specific estimator, which is SK learn, we require the source directory, the entry script, which we've already created our script parameters, a compute target, which is local, and then inputs, which allows us to pass it as an object as an input.

We also need a Data Prep package, as mentioned earlier. Next, we create our experiment, We run our experiment, and we show the run details while running. When the experiment has completed, in the widget, we can view Azure ML logs, We can also do this to verify that the file data set was processed and the data files downloaded.

About the Author

Kofi is a digital technology specialist in a variety of business applications. He stays up to date on business trends and technology and is an early adopter of powerful and creative ideas.
His experience covers a wide range of topics including data science, machine learning, deep learning, reinforcement learning, DevOps, software engineering, cloud computing, business & technology strategy, design & delivery of flipped/social learning experiences, blended learning curriculum design and delivery, and training consultancy.