This course introduces you to PyTorch and focuses on two main concepts: PyTorch tensors and the autograd module. We are going to get our hands dirty throughout the course, using a demo environment to explore the methodologies covered. We’ll look at the pros and cons of each method, and when they should be used.
Learning Objectives
- Create a tensor in PyTorch
- Understand when to use the autograd attribute
- Create a dataset in PyTorch
- Understand what backpropagation is and why it is important
Intended Audience
This course is intended for anyone interested in machine learning, and especially for data scientists and data engineers.
Prerequisites
To follow along with this course, you should have PyTorch version 1.5 or later.
Resources
The Python scripts used in this course can be found in the GitHub repo here: https://github.com/cloudacademy/ca-pytorch-101
Welcome back! In this lecture, we are going to look at the autograd backend in PyTorch. This package is essential for computing gradients using PyTorch, and is therefore extremely useful for model optimization, and so it makes sense for us to take a look at it.
So essentially we have one question: what are gradients? To answer this, let me start with a simple example. Suppose we have an application that requires us to minimize some differentiable cost function, and suppose that this cost function is given by x square minus five. Typically, when we want to minimize such differentiable functions, we have to compute the gradients. Gradients are therefore nothing more than the partial derivatives of the cost function - and in our case this translates to the computation of the partial derivative of the loss with respect to the tensor x.
The objective of this lecture is to understand how to compute gradients in PyTorch.
We can visualize this simple operation as follows: we have some inputs data, which are used in the output operation provided by the above loss function. Without going deep into the mathematical details, it is important that you understand that we have two main steps in such a minimization procedure: a Forward Propagation step, where we compute the loss based on the inputs. And the Backward Propagation step, where we compute the gradient of the cost function with respect to the inputs - in our case with respect to the tensor x.
In a way, backpropagation is just a fancy name for the chain rule that you learned in high school during basic calculus class.
The gradients are therefore the partial derivatives of the loss, and PyTorch provides the autograd package, which essentially does all the steps for us. Hence, what you need to do is understand when to use it, and how to interpret the results. The dirty job will be taken into account for you by the autograd module.
We import torch and in general, the autograd package is identified by the torch autograd engine, which provides automatic differentiation for all operations on Pytorch Tensors. It is a key operation especially during neural network training, or more simply, in our example that is described above.
But how does it work in practice?
Take a tensor made of five values - I am going to create this manually, so call torch dot tensor and pass a list of five values, you can obviously generate them with the rand method we saw in the previous lecture, so don’t worry about the actual values I am inserting here.
I am also going to define a loss function, which is nothing more than x squared minus five. Let’s inspect the loss: it is made of 5 elements, consisting of the application of the loss function with input tensor x that we created manually.
We see that PyTorch is applying the loss function operations element-wise to tensor x, and so we can say that this is the forward propagation step, right?
Given the input, we get an output based on a loss function. Now, to tell PyTorch we wish to compute the gradients, we have to set the tensor argument requires_grad as equal to true.
So copy and paste tensor x, and set the argument requires_grad to True - by default, it is set to false. By setting this attribute to True, all operations on the tensor are tracked in the so-called computational graph, which means every operation on the tensor is tracked by the PyTorch backend for us.
Here it’s worth giving you some technical details. By design, PyTorch uses a dynamic computation graph, and this is different from Tensorflow, the Computational Graph of which is static. We are not going into the details on how the Computation Graph is built inside Tensorflow, but it is worth noting that, in PyTorch, this operation is dynamic because whenever we create any operation involving a tensor, it is executed immediately.
We can get a better understanding of what's going on with the assistance of a Computational Graph. It helps us when thinking about mathematical expressions in intuitive ways. In our case, the Computation Graph looks like this one.
So basically in this Computational Graph, we have a subtraction operation that aggregates the inputs inside the loss function, and that is actually used to compute it.
Let’s compute the loss function again since now we have the autograd active.
If I now print the loss, you see that the tensor is associated with a gradient function specified by the grad_fn argument, and this is of type SubBackward0. What does it mean? It is the operation stored in the computation graph - the sub stands for subtraction here. You can also access directly using the grad_fn attribute, as follows
Now, to compute the gradients, all you have to do is apply the backward function to the loss. Since in our case we are expecting a non-scalar output, we must specify the gradient argument inside the backward call, which is simply a tensor with the same shape as tensor x. For simplicity, we are going to define it as a tensor of ones.
So let us apply the backwards method on the loss, and we specify the argument gradient as the torch dot ones made of five elements equal to one.
With this operation, we allow the backward method to perform the vector Jacobian product to get the gradients. If you are interested, you can get more info from the official pytorch documentation here.
Hence the backward method computes for us the gradient with respect to the tensor x, which is then accumulated into the grad attribute. This takes into account the partial derivate of the function with respect to the tensor. Let’s inspect this attribute. We have the partial derivative of the loss with respect to the tensor’s elements.
To check PyTorch is correctly computing the gradients, we can explicitly compute the algebraic derivative of the loss with respect to x.
This is a simple check, and you do not need to be a strong mathematician to understand this, but please take into account that PyTorch is computing for us automatically the following partial derivative, namely the partial derivative of the loss with respect to tensor x. And this is nothing more than two times x.
So if we explicitly compute 2 times x we get exactly the same results as the one we got with the backward method.
The autograd package keeps track of all tensors' operations, along with the resulting new tensors, in a directed acyclic graph (or DAG). In this DAG, leaves are the input tensors and roots are the output tensors. This concept can be extracted from the previous graph. This is a really nice feature, but it is not recommended when your input data and/or the Computational Graph structure is too complicated. So it is good practice to prevent PyTorch from keeping track of all the gradients' history inside the DAG.
An example is when you are training a Neural Network. Typically, in a backpropagation step, you update the weights on the net, and this operation should not be part of the gradient computation. So the question now is: how can we prevent PyTorch from keeping the history of gradients?
There are different ways, and in this lecture, we are going to look at two of them.
First, you can apply the detach method to it. Detach will create a new tensor with the same values but it does not require the gradient. This means that when we compute the loss, then PyTorch will not create a gradient function to be stored inside the DAG.
I am going to create a two by two tensor using the rand method, and I store this into the variable a. I also set the argument requires_grad to True. In this way, I am explicitly telling PyTorch to keep track of the gradients in the DAG.
Then I create another tensor, say b, which is the detached version of a. With the detach, we are preventing PyTorch from keeping track of the gradients’ operations inside the DAG, as you can see here by accessing the attribute requires_grad. When this is False, PyTorch will not create a gradient function to be stored inside the DAG.
An alternative solution, which is pretty common in practice, is to use the wrapper no_grad before a tensor operation.
Let’s create the tensor c, and then we can wrap the operation on the tensor c with the torch no grad method if we do not want to store the gradient function inside the DAG.
For example, when we compute c squares, we see that the requires_grad attribute is False. There are many other ways to do that - the two I have just proposed are the most popular in practice, and in my opinion the no_grad wrapper is the most common one.
If you are enjoying this course, and you believe PyTorch will be useful to you when performing your daily tasks, then you might start reading different projects, and I am pretty sure that sooner or later you will encounter no_grad wrapper many times.
Ok that concludes this lecture. We’ve seen the importance of the autograd package in PyTorch. In the next lecture, we are going to investigate how to create and load a dataset in Pytorch. See you there!
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.