LSTM and GRU
Recurrent Neural Networks
The course is part of this learning path
From the internals of a neural net to solving problems with neural networks to understanding how they work internally, this course expertly covers the essentials needed to succeed in machine learning.
This course moves on from cloud computing power and covers Recurrent Neural Networks. Learn how to use recurrent neural networks to train more complex models.
Understand how models are built to allow us to treat data that comes in sequences. Examples of this could include unstructured text, music, and even movies.
This course is comprised of 9 lectures with 2 accompanying exercises.
- Understand how recurrent neural network models are built
- Learn the various applications of recurrent neural networks
- It is recommended to complete the Introduction to Data and Machine Learning course before starting.
Hello and welcome to this video on Long-Term Short-Term Memory Networks and Gated Recurrent Units, LSTM and GRUs for short. In this video, we will talk about two different types of recurrent neural networks that do not suffer from the problem of vanishing gradients. As you've seen, the vanilla implementation of a recurrent neural network where the output is fed back into the input suffers from the fundamental problem of vanishing gradients and it's not able to capture long-term dependencies in a sequence. This is a problem because we would like our RNN to be able to analyze text and answer questions which involves keeping track of long sequences of words. A brilliant scheme to solve this problem was proposed in the late '90s and it's called the Long-Term Short-Term Memory Network.
This network is organized in cells which include several operations each. Let's look at them in detail. The first difference from the vanilla RNN is the presence of an internal state variable. This is passed from one cell to the next and it's modified by operation gates. The first gate is called the forget gate. It's a sigmoid layer that takes the output at t minus one and the current input at times t, concatenates them into a single tensor, and then applies a linear transformation followed by a sigmoid. Because of the sigmoid, the output of this gate is a number between zero and one. This number multiplies the internal state and this is why the gate is called a forget gate. If f sub t is zero, the previous internal state will be completely forgotten. While if it's one, it will be passed through and altered.
The second gate is the input gate. The input gate takes the previous output and the new input and passes them through another sigmoid layer, very similar to the forget gate. Like in the previous case, this gate returns a value between zero and one. The value of the input gate is multiplied with the output of the candidate layer. This layer applies a hyperbole tangent to the mix of input and previous output, returning a candidate vector to be added to the internal state. For example, if we are building a language model, this gate would control which new relevant features to include in the internal state. The internal state is updated with this rule, the previous state is multiplied by the forget gate and then added to the fraction of the new candidate allowed by the input gate. Finally, we have an output gate.
This gate controls how much of the internal state is passed through the output and it works in a similar way to the other gates. So let's recap how the LSTM works. It has three gates and they all work in the same way. They take the previous output and the current input, apply a linear transformation, and then a sigmoid activation function. Since these three gates have independent weights and biases, the network will learn how much of the past output to keep, how much of the current input to keep, and how much of the internal state to send out to the output. The other three components of the LSTM unit are the internal state, that is passed from one iteration to the next as a conveyor belt, the tanh layer, generating the candidate input to add to the internal state, and the tanh transformation of the internal state before it goes to the output gate. Pretty simple. This formulation of a recurrent neural network is great because it does not suffer from the vanishing gradient problem.
And therefore, it can be used to approach more complex problems like question answering for example. Finally, we mentioned a simpler version of the LSTM called GRU or Gated Recurrent Unit. This unit is similar to the LSTM, but it simplifies it in several ways. First, it doesn't pass along two separate variables, the internal state and the output. It only passes the output to the next iteration. Second, it only has two gates instead of three. The first gate controls the mixing of the previous output with the current input and the mix is fed to a tanh layer for output. The other gate controls the mixing of the previous output with the current output. Look closely at the last formula in the last line. Doesn't it look familiar? Yes, it's the EWMA again. It's the Exponentially Weighted Moving Average. The GRU applies the EWMA to filter the raw output, h tilde sub t, with a fraction zed sub t that is learned from the training set. In conclusion, in this video, we've introduced the LSTM, Long-Term Short-Term Memory Network, and explained how it can learn to selectively remember and forget past information. We've also presented the GRU, which is a simpler version of the same type of unit. Thank you for watching and see you in the next video.
About the Author
I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-founder at Spire, a Y-Combinator-backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.