Creating a Dataset in PyTorch

Contents

keyboard_tab
Start course
Overview
Difficulty
Intermediate
Duration
1h 1m
Students
57
Ratings
4/5
starstarstarstarstar-border
Description

This course introduces you to PyTorch and focuses on two main concepts: PyTorch tensors and the autograd module. We are going to get our hands dirty throughout the course, using a demo environment to explore the methodologies covered. We’ll look at the pros and cons of each method, and when they should be used.

Learning Objectives

  • Create a tensor in PyTorch
  • Understand when to use the autograd attribute
  • Create a dataset in PyTorch
  • Understand what backpropagation is and why it is important

Intended Audience

This course is intended for anyone interested in machine learning, and especially for data scientists and data engineers.

Prerequisites

To follow along with this course, you should have PyTorch version 1.5 or later.

Resources

The Python scripts used in this course can be found in the GitHub repo here: https://github.com/cloudacademy/ca-pytorch-101 

 

Transcript

Welcome back. In this lecture, we are going to look at how to create a Dataset in PyTorch, and we will do that using the Dataset class.

So first, we import Torch. In PyTorch, a dataset is represented by a regular Python class that inherits from the PyTorch Dataset class. You can think of it as a kind of Python list of (features, target) tuples. 

We can therefore proceed as follows: we import from torch dot utils dot data the Dataset class. We are going to define a standard Python class that we call CustomDataset, and that inherits from the PyTorch Dataset class. And typically, we distinguish three major methods inside this class. 

At first, we define the init: it takes whatever arguments are needed to build a list of tuples. So for example, in a Supervised Machine Learning problem, it may contain two tensors - one for the features, denoted by X, another one for the target, denoted by y.

Then we have the len method: this function simply returns the size of the whole dataset.

We can also add a docstring that specifies the objective of this method, namely, it returns the total number of samples available in our dataset.

Finally, we have the get_item method: this function loads and returns a sample from the dataset at the given index idx that I have specified here. It must return a tuple (features, target) corresponding to the requested data point.

Hence it returns a tuple made of a set of features and the target at the index idx. Also, let’s specify the docstring that says it returns a sample of data at a precise index.

And that’s pretty much it! We have created a Python class made of three major elements, and we recognize two main methods: the len method that returns the total number of samples available in our dataset, and get_item, that returns a sample of data at a precise index.

Ok, now let me proceed as follows. I am going to use scikit learn to generate some toy data. In particular, I am going to use the dataset module, and I import the make_classification function. This function is pretty useful if one wants to create a supervised classification dataset. If you do not remember it, don’t worry: let’s call it via the help function: as you see, this function requires several arguments, such as the number of samples we wish to generate, and the number of features our dataset should have.

This function is going to return two arrays, namely X - the generated samples - and y - the labels.

We can create, for instance, a dataset with 1000 examples, and 5 features. This is assigned to the data and target variables.

We create a PyTorch dataset by passing the data to our custom class. So we call CustomDataset by passing the data as X and the target arrays as y. We store this into the variable custom_dataset.

We can check the length of the custom_dataset - and that is indeed 1000. Or we can access the custom dataset in position zero, and we get the 5 features and the target value for that example. To be sure that those are indeed the data in position zero, let’s check it. Cool, we are getting the same elements. Obviously the data does not contain the target, which is stored into the variable target in position zero.

We can check that we indeed have five features by accessing the custom dataset in position zero and then getting the first array and accessing its shape attribute. It should be five.

This is pretty nice since we have a dataset used for binary classification, and since we have just two possible classes for each example - either zero or one. But what if we have multiple targets?

Well to answer this question, let’s create a dataset that allows us to perform multi-label classification. To do so, we use the make_multilabel_classification function from the scikit learn dataset module. Again, we can access help by putting a question mark before the method’s name.

Once again, we have the argument number of samples and the number of features, but now we also have the number of classes and the number of labels that can be assigned to each class.

Just to give you an example, a binary classification is a problem where you have to classify objects into two categories, for example, the classification of emails into spam or not spam, whether a patient is at risk of cancer or not, or whether the client is going to churn or not.

In multiclass classification, however, the number of classes is greater than two, and this is used to classify things like newspaper articles, images, and so on so forth.

So we create a multilabel classification with number of samples equal to 1000, number of features equal to 5, and number of classes equal to 3.

Then, we proceed as before: we create a custom multi-label dataset by calling the CustomDataset class with the data and target we just created. We can see that, for the first example in our dataset, we have a set of features, and a set of classes (three) all with same values - in this case one.

We can also check the third example, namely the custom dataset mlb in position 2: we see that we have five features, and again three classes with the same labels but in this case they are all equal to zero.

The bottom line is that PyTorch Datasets are objects that have just one job: to return a single datapoint on request. The nice thing about it is that it allows you to build a PyTorch consistent dataset.

On the other hand, however, you need to loop over the index rows to get all possible elements inside a Dataset object. And this is not really efficient, especially with large datasets.

So far we have seen how to create a dataset in PyTorch, but now we want to understand how to load that dataset efficiently, that is once we have created a dataset, typically one that is going to ingest that source in a training phase. Hence, we have to prepare the data for the training of our model, and typically this is done using PyTorch DataLoaders.

Put in other terms, the Dataset class retrieves our dataset’s features and labels one sample at a time. When training a model, we typically want to pass samples in minibatches. This is pretty useful especially when you want to perform gradient descent efficiently, meanwhile reshuffling the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval. To do so, we can use the DataLoader built-in PyTorch class, which is an iterable that abstracts this for us in an easy API.

So let me first close help: we do not need that. So first, we import DataLoader from torch dot utils dot data, and we initialize the DataLoader class with the following arguments. We specify the dataset, which is, in our case, the custom_dataset we used for binary classification, we specify the batch_size equal to eight, and we set shuffle as True. This argument is pretty useful during the model training phase since, when is set to true, the data is reshuffled at every epoch, and this helps to reduce overfitting.

The DataLoader has many other arguments: you can inspect them using the help feature, and in particular, here you see that the docstring says that the argument shuffle is set to true so that the data are reshuffled at every epoch, whereas the batch size tells you how many samples per batch to load - and this is pretty important when we have memory constraints in our machine. There are many other arguments here: feel free to pause the video now and check all of them if you're curious.

So we store this object into the data_loader variable, and in order to understand the importance of a data loader, take this as an example. We consider the custom_dataset in position zero, which returns the first element of our data made up of the tuple described in terms of features and target. But in this way, we are accessing the data by index, which might be a little sloppy with large datasets.

Instead, you can iterate through the dataset using the dataloader. To do so, call the Python iter method on the data_loader, and store this object into the variable data_iter.

This is an iterable called SingleProcessDataLoaderIter, and for each step of this iterative procedure, we are going to get eight samples at a, instead of just one like we did in the dataloader.

To check this, apply the next function to the data_iter object. Let’s store this into the variable data_02, and if we inspect this object, we get a list of two tensors, one for the feature and one for the target, so we store those from data_02. And it is easy to see that we have eight examples.

DataLoaders are extremely useful especially when you have to perform tedious operations with tensors, and they should be used in order to speed up the model training phase when using PyTorch. 

That now concludes this lecture. In the next one, we are going to take a deeper look into the concept of backpropagation, and wrap up what we have seen on the aurograd module with an example. So I’ll see you there.

About the Author
Avatar
Andrea Giussani
Data Scientist
Students
1109
Labs
10
Courses
7
Learning Paths
1

Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.

He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.