Matrix Notation
Start course
1h 45m

Learn about the importance of gradient descent and backpropagation, under the umbrella of Data and Machine Learning, from Cloud Academy.

From the internals of a neural net to solving problems with neural networks to understanding how they work internally, this course expertly covers the essentials needed to succeed in machine learning.

Learning Objective

  • Understand the importance of gradient descent and backpropagation
  • Be able to build your own neural network by the end of the course




Hello, and welcome to this video on matrix notation. In this video, we will introduce a matrix notation that will simplify a lot what we've written for back-propagation and we will rewrite back-propagation using matrices. In the last video, we calculated the back-propagation formulas for the fully connected network. And they looked quite complicated. In this video, we will rearrange all the weights in a big matrix called big W, and all the deltas at layer l in a vector that we will call delta l. Doing this, the sum at each node become just the matrix multiplication, and we're left with a much simpler set of formula. 

Let's see them. In this notation, the deltas of the last layer, big L, are the element-wise product of the gradient of the cost with respect to the last activation, a big L, times the derivative of the activation function at the last layer. The circle dot indicates the element-wise product of two vectors, and it's also called the Hadamard product. The deltas in the inner layers are calculated with the recursive formula using the deltas of the next layer. So, the delta at layer l is equal to the dot product of the weight matrix connecting layer l to layer l+1, with the deltas at layer l+1 each multiplied by the derivative of the activation function. 

Finally, the corrections to each weight and biases are obtained from the deltas. So, we can summarize the back-propagation algorithm in the following steps. First, we perform forward propagation. We calculate the input sum and the activation of each neuron proceeding from input to output. Second, we calculate the error signal of the final layer, big L, by obtaining the gradient of the cost function with respect to the outputs of the network. This expression will depend on the training data and training labels, as well as on the chosen cost function. 

But it is well-defined for a given training data and cost. Third, we calculate the error signals of the neurons in each layer, going backwards from output to input, using the recursive formula of the deltas. Fourth, we calculate the derivative of the cost function with respect to the weights, using the deltas. This will be a matrix with a same shape as the weight matrix. Next, we calculate the derivative of the cost function with respect to the biases, also using the deltas. And this will be a column vector at each layer. Finally, we update each weight and each bias according to the update rule, of subtracting the gradient from the current value. 

So, congratulations, you've now gone through the back-propagation algorithm, and hopefully see that it's just a bunch of matrix multiplications. The bigger the network, the bigger your matrices will be, and so, the larger the matrix multiplication products you'll have to do. We'll go back to this in a few sections when they talk about GPUs. For now, again, congratulate yourselves, and be sure that you've gone through one of the hardest part of understanding neural networks. They will have no more mysteries for you. Thank you for watching, and see you in the next video.

About the Author
Learning Paths

I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-­founder at Spire, a Y-Combinator-­backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.