# Backpropagation in PyTorch

## Contents

###### PyTorch 101

## The course is part of this learning path

This course introduces you to PyTorch and focuses on two main concepts: PyTorch tensors and the autograd module. We are going to get our hands dirty throughout the course, using a demo environment to explore the methodologies covered. We’ll look at the pros and cons of each method, and when they should be used.

### Learning Objectives

- Create a tensor in PyTorch
- Understand when to use the autograd attribute
- Create a dataset in PyTorch
- Understand what backpropagation is and why it is important

**Intended Audience**

**This course is intended for anyone interested in machine learning, and especially for data scientists and data engineers.**

**Prerequisites**

To follow along with this course, you should have PyTorch version 1.5 or later.

### Resources

The Python scripts used in this course can be found in the GitHub repo here: https://github.com/cloudacademy/ca-pytorch-101

Welcome back! In this lecture, we are going to look at a very technical concept that is really important if you really want to work with PyTorch, especially when training a neural network model. And that is backward propagation. To make the explanation of this concept simple, let’s suppose we have a composition of two functions, that are described by w equal to g of f of x. Technically speaking, such a function requires the set of images of f to be the domain of g. When training a Neural Network, we want to minimize the use of this kind of (loss) function.

We have seen that we can do this by computing gradients, namely by computing the derivative of the loss with respect to the tensor x. And in our case, if we want to compute the derivative of the function w with respect to the tensor x, this translates thanks to the chain rule to the product of the derivative of w with respect to y, with the derivative of y with respect to x.

In order to better understand this framework, let’s look at the following diagram. You have an input x, which is mapped into y though the function f; and then, if you apply the function g to the output tensor y, you get w. But this operation is nothing more than the composition of two functions, f and g, with respect to the input x.

So we have two ways of defining the same operation. Now, before we have defined the derivative of w with respect to x as the product of two derivatives, namely the derivative of w with respect to y, in our case this is equal to f of x, and then the derivative of y with respect to x. Namely, we are going backward with respect to the composition function.

In a nutshell, we have two tensors, namely x and y, and w is nothing more than the application of a differentiable function with respect to y. It should be easy for you that we have expressed this complicated concept in terms of a Computational graph: nice, isn't it? We are not going into the mathematical details of such an operation, but it is important we understand the concept of local gradients before moving on.

So, what is a local gradient? Well, if you remember from previous lectures, every time you compute a gradient you are implicitly performing an operation on the computational graph.

In our example, this translates into the multiplication of two tensors, which means that we can compute independently the partial derivative of w with respect to y and the partial derivative of y with respect to x. But those two quantities are called local gradients, and play a key role in backpropagation. Put in simpler terms, local gradients are the partial derivatives performed on the computational graph operation.

Ok now let’s get our hands dirty, and let’s consider the following Computational Graph. This is the Computational Graph of an Ordinary Least Square Model. Why is that? Well, we have two inputs, a tensor x and a tensor w denoting the weights of our model, and the predicted value is obtained as a linear combination of those weights and x, which typically denotes the set of features. And then we define a loss function to evaluate the model performance. If you remember from basic statistics, in an ordinary least squares model, we minimize the loss function defined as the sum of squared errors. Here we are splitting the transition phase into two distinct phases: we firstly define the difference between the prediction and actual value, and then we square that tensor to get the final loss.

So the question is: how do we minimize such a loss?

Well, assuming we have performed the necessary forward step - namely we compute the loss with respect to the input tensors - we then need to calculate the local gradients at each node inside the computational graph.

Notably, we are going to compute the partial derivative of the loss with respect to l, we then compute the derivative of l and y, and finally we compute the partial derivative of y hat with respect to w. And then those partial derivatives are used to get the partial derivative of the loss with respect to the input w via backpropagation - or better by a simple application of the chain rule.

Please note that we do not need to compute the partial derivative of l with respect to y and the partial derivative of y hat with respect to x because they are fixed tensors in a typical OLS application. Put in other terms, x and y are given, and so when minimizing a loss function, that which changes is the set of parameters we wish to estimate - in our case the weights w - and not the features. That’s why we do not compute the local gradient of y hat with respect to x.

This was the technical part. So let’s get our hands dirty and wrap up everything we have learnt on backpropagation using PyTorch here in this example.

So we import torch and we create two tensors, one for x and one for y. We are going to use scalars for simplicity here, say 1.0 for x and 2.0 for y.

Then, we define the parameter we wish to optimize - in our case denoted by w - and this is a torch tensor with value 1.0 and we set requires_grad equal to true since we want to compute the gradients. Remember, we previously said that during the minimization phase, we have a necessary forward pass, so we have two define two quantities: first, the y hat is the result of the multiplication of the weights with the tensor x, and the loss function which is y hat minus y squared.

We can print the output of the forward step by passing basically the loss we just computed. Does this result make sense? Well, yes. From an algebraic point of view, we see that x is 1 and w is 1 as well. Their product is therefore 1, right? The square of -1 is also 1, which is the loss output.

So everything is fine. But remember, a minimization process requires a forward step and then a backward step. To perform backpropagation, we apply the backward function on the loss. Please revisit Lecture 4 for details on this function if you do not remember it.

After having computed the backward pass to compute the gradient of the loss with respect to w, we also print the backward step of w, since we are interested in optimizing this set of weights, and we basically pass w dot grad, which is another tensor holding the gradient of w with respect to some scalar value. This is -2.

After having updated the weights based on a learning rate, we continue the minimization phase with a new forward and backward step until a minimum is reached.

We continue updating weights but please remember that this operation should not be part of the computational graph. But what does that mean? Simply put, it means that we have to set the gradient to zero after having updated the weights (at each epoch or step). So at the end, we wrap the update operation inside a no_grad method, and we also specify that the weights are updated based on the product between a learning rate - here specified as 0.01 - and the gradient of w. We're not going to cover how to optimize the learning rate in this course, but please consider that this is a hyperparameter that has to be tuned. We print the updated weight at each step, as well as the gradient of w.

Practically speaking, while here we are learning the weights at each step with a particular learning rate, we expect that this print should be 1.02 but down here since we are setting the gradient to zero at each step, I expect this to be -2, as we computed before in the first step.

A simple run confirms our expectations: the updated weights is 1.02 after one epoch, thanks to the fact that we have a learning rate of 0.01, and the gradient is -2.

Obviously, in a Neural Network training phase, you perform several forward and backward passes. But the logic remains the same as the one described here: you have an update on the weights, based on the learning rate, and on the gradient you just computed, and thanks to this highlighted operation, the gradient’s function is not part of the computational graph.

That concludes the Lecture on Backpropagation, where we put into practice what we learnt about the autograd package in PyTorch, with a real application related to Ordinary Least Squares. Thanks for watching!

Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.

He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.