Working with and Creating Datasets
Start course
1h 23m

Learn how to operate machine learning solutions at cloud scale using the Azure Machine Learning SDK. This course teaches you to leverage your existing knowledge of Python and machine learning to manage data ingestion, data preparation, model training, and model deployment in Microsoft Azure.

If you have any feedback related to this course, please contact us at

Learning Objectives

  • Create an Azure Machine Learning workspace using the SDK
  • Run experiments and train models using the SDK
  • Optimize and manage models using the SDK
  • Deploy and consume models using the SDK

Intended Audience

This course is designed for data scientists with existing knowledge of Python and machine learning frameworks, such as Scikit-Learn, PyTorch, and Tensorflow, who want to build and operate machine learning solutions in the cloud.


  • Fundamental knowledge of Microsoft Azure
  • Experience writing Python code to work with data using libraries such as Numpy, Pandas, and Matplotlib
  • Understanding of data science, including how to prepare data and train machine learning models using common machine learning libraries, such as Scikit-Learn, PyTorch, or Tensorflow


The GitHub repo for this course, containing the code and datasets used, can be found here: 


While we can read data directly from datastores, Azure ML provides extraction for data in the form of datasets. A data set is a version reference to a specific set of data, that you may want to use and experiment with. Data sets can either be tabular or file based.

Let's create a data set from the diabetes data we uploaded to the data store, and view the first 20 Records. In this case, the data is in a structured format in a CSV file. So we'll use a tabular data set. So to do this, we get the default data store. And then we create a tabular data set from the path, on the datastore.

Now, note that it might take a bit of time to get this sorted out, and I would take a tabular dataset and convert it to a data frame and display the first 20 rows. So as you can see, it's quite easy to convert a tabular dataset to a Pandas data frame, enabling us to work with the data using common Python techniques.

If you're working on machine learning scenarios that require structured data, you can create a file dataset for that. This creates a list of file paths and eventual mount points, which you can use to read the data in the files. So here we create a file data set from the path on the data store. And then we get the files in the dataset. So we've got diabetes and diabetes2.

About the Author

Kofi is a digital technology specialist in a variety of business applications. He stays up to date on business trends and technology and is an early adopter of powerful and creative ideas.
His experience covers a wide range of topics including data science, machine learning, deep learning, reinforcement learning, DevOps, software engineering, cloud computing, business & technology strategy, design & delivery of flipped/social learning experiences, blended learning curriculum design and delivery, and training consultancy.