Start course
1h 45m

Learn about the importance of gradient descent and backpropagation, under the umbrella of Data and Machine Learning, from Cloud Academy.

From the internals of a neural net to solving problems with neural networks to understanding how they work internally, this course expertly covers the essentials needed to succeed in machine learning.

Learning Objective

  • Understand the importance of gradient descent and backpropagation
  • Be able to build your own neural network by the end of the course




Hello and welcome to this video on optimizers. In this video, we will review the most commonly used minimizational algorithms. These are called optimizers. In the last video we introduced the exponentially weighted moving average. Let's see how it's applied to neural networks. As you know, we use an optimizer to find the minimum value of the cost function, that is to perform back-propagation. The only optimization algorithm we've met so far is the stochastic gradient descent or SGD. SGD only needs one hyper-parameter, the learning rate. Once we know that, we proceed in a loop by sampling a mini-batch from the training set, computing the gradients, and updating the weights by subtracting the gradient times the learning rate. Notice that the weights here are indicated with the letter theta, but nothing changes from what we've learned before. 

A first improvement of the SGD is to add momentum. Momentum means that we accummulate the gradient corrections in a variable called velocity that basically serves as a smooth version of the gradient. Notice how similar is the equation highlighted to the exponentially weighted moving average that we've just encountered. Momentum is like saying if I'm going down in a certain direction, then I should keep going down more or less in that direction and correct a little bit using the new gradients, but avoid jumping from one update to the next very abruptly. 

Nesterov momentum is similar to momentum because it also accumulates gradients updates in a velocity variable, however, those gradients are calculated using an interim update of the parameters. In other words, instead of calculating the gradients with the current value of the parameters, like in the momentum algorithm, we first perform a temporary update of the parameters using the velocity that we had previously calculated, then we calculate the gradients at this interim point, and finally we update the parameters for real using this version of the gradients that has been just calculated. SGD and SGD plus momentum keep the learning rate constant. We can improve them by allowing the learning rate to adapt to the size of the gradient. 

For example, the AdaGrad algorithm accumulates the square of the gradient into a variable and computes the update with an inverse function of such square. The result of this is to keep the size of the parameter updates stable, regardless of the size of the gradient itself. RMSProp is also adaptive, but it allows to choose the fraction of squared gradients to accumulate using an EWMA decay in the accumulation formula. Finally, the Adam algorithm uses EWMA for both the gradient and the square of the gradient. In summary, in this lecture, we've seen some of the most popular optimizational algorithms. 

Now, you're probably wondering how to choose the best one, and, unfortunately, there is no best one. Each of them performs better in some conditions. What is true though is that a good choice of the hyper-parameters is key for an algorithm to perform well. I encourage you to familiarize yourself with one algorithm and understand the effects of changing hyper-parameter. In section nine, we will also learn how to automate the search for the best hyper-parameters. In conclusion, in this video, we've reviewed the most important optimizers and their differences, we've learned that some of them have an adaptive learning rate, and we've understood that it's important to choose your hyper-parameters when you switch to a more complex optimizer. Thank you for watching and see you in the next video.

About the Author
Learning Paths

I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-­founder at Spire, a Y-Combinator-­backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.