Gradient Descent
Start course
1h 45m

Learn about the importance of gradient descent and backpropagation, under the umbrella of Data and Machine Learning, from Cloud Academy.

From the internals of a neural net to solving problems with neural networks to understanding how they work internally, this course expertly covers the essentials needed to succeed in machine learning.

Learning Objective

  • Understand the importance of gradient descent and backpropagation
  • Be able to build your own neural network by the end of the course




Hello, and welcome to this video on gradient descent. In this video, we will learn about different types of gradient descent, and we will learn about the concept of batch, and, in particular, the fact that we need to choose a batch size. How do back propagation and gradient descent work in practice in deep learning? As we've seen, the gradient is calculated from the cost function evaluated on the training data. x and y here indicate a pair of training features and labels. In principle, we could feed the training data one point at a time to the cost function and, for each pair of features and labels, calculate the cost and the gradient and update the weights accordingly. So, one point goes in, we do forward propagation, back propagation, and update the weights. This is called stochastic gradient descent. 

Once our model has seen each training data once, we say that an epoch has completed, and we start again from the first training pair with the following epoch. Stochastic gradient descent is a very noisy estimation of the gradient, because a single training data point is used to estimate the gradient. You can improve it by averaging the gradients over the training data before we update the weights. This is how normal gradient descent works. In normal or batch gradient descent, we first calculate the gradient for all training pairs, and then, we average the gradients to update the weights. 

While more accurate, this method is also not optimal, since a single update requires calculating the gradients for all the training data, and we basically end up doing one weight update per epoch. A compromise solution is called mini-batch, or mini-batch gradient descent. In this case, we will still average the gradient calculation, but only over a small number of points taken from a sample of the training set.

 It's common to take a power of two, so, for example, you could take 16 points, 32 points, 64 points, et cetera. This method gives us the best of both approaches. By averaging a few points, we get a better estimation of the gradient, a less noisy estimation of the gradient, but we also do many updates per epoch, speeding up training in this way. In conclusion, in this video, we've seen that there are three ways of doing gradient descent: normal gradient descent, stochastic gradient descent, and mini-batch gradient descent. For mini-batch, we've learned that there is a choice to be made which the size of the batch we are using for each weight update. Thank you for watching and see you in the next video.

About the Author
Learning Paths

I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-­founder at Spire, a Y-Combinator-­backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.