A Comparison between RNN, LSTM, and GRU

A Note for the basic designs of the Recurrent Neural Networks

Rice Yang
5 min readDec 7, 2022
Source: https://www.pinterest.com/pin/642114859350713349/

This article will explain the basic understanding of RNN and its advanced versions, LSTM and GRU. We also compare the difference between these designs. At the end of this article, we’ll talk about which design should be selected for your practical application.

Source: RNN, LSTM & GRU

Recurrent Neural Network: RNN

First of all, the RNN is a kind of neural network that is able to feed with a sequence input data with variable length, e.g., the soundtrack, video, article, etc. The length of its output is also variable. The feature of RNN is that it is designed with a non-learnable hidden vector and passes it through the data sequence. The hidden vector helps RNN to utilize past information to predict future information with current input.

RNN is similar to the original neural network. The critical difference is that the RNN defines a hidden vector as the input of the RNN, and updates it iteratively through the timesteps of the sequence.

Source: RNN, LSTM & GRU

The symbol h(t) means the hidden vector at a timestep t. The update formula of h(t) could be understood as a linear combination between input x(t) and the hidden vector from the previous timestep h(t-1) with a sigmoid nonlinear activation function. Also, the output y(t) is the linear transformation of h(t) with a followed sigmoid function.

For the following LSTM and GRU, the output is the same as RNN, the result generated from a classic MLP layer, a linear transformation followed by a nonlinear operation, fed with a hidden vector.

Long Short-Term Memory: LSTM

Source: RNN, LSTM & GRU

It’s more complex than RNN. First of all, it introduces the cell vector C to prevent the vanishing gradient issue.

The cell vector can be understood as another hidden vector with a different purpose. To explain the cell vector C(t), the 5th row shows that the C(t) is updated by a linear combination with the cell vector from the previous and current timestep. That helps the gradient propagation without the disruption by the sigmoid function since the differentials of the cell vector become constants.

After the cell vector fixes the issue of vanishing gradient, the hidden vector could be updated easier with the cell vector, according to the 6th formula above.

Note the f(t), i(t), and o(t) stand for the forgot gate, input gate, and output gate, separately. The GATE is a weight vector that ranged by [0, 1]. The gate vector works like a gate, deciding which elements should be kept and which should not. All the gates are calculated by the hidden vector h(t-1) and input data x(t), followed by a sigmoid activation.

The naming of LSTM is from the theory of human memory: long-term and short-term memory. The cell vector is acted as a long-term memory since its gradient is easier to pass through the data sequence without vanishing gradient. The hidden vector is acted as short-term memory. It is extracted from long-term memory. Moreover, which part of memory should be remembered or forgotten is decided by the gates that are generated from the previous short-term memory and current input data.

Gated Recurrent Unit: GRU

We have explained the meaning of gate in LSTM. The GRU uses fewer gates and “memory” to improve the computation efficiency of LSTM.

Source: RNN, LSTM & GRU

The GRU inherits the benefit from LSTM. It designs the hidden vector to be updated as the linear interpolation between hidden vectors at different timesteps. That helps the gradient propagation and avoids the vanishing gradient.

The symbols z(t) and r(t) stand for the update gate and reset gate. The update gate decides which part of the previous memory should be reset and the update gate decides the linear interpolation ratio between previous and current memory.

In Practice

Comparing the RNU with LSTM, the RNU is more efficient since it not also has fewer parameters and less computation cost, but also prevents the issues from vanishing gradient. But, when the network accuracy or metrics is your top priority, usually the LSTM brings better results since it utilizes more parameters and computation.

On the other hand, if your task is simpler and the length of your data sequence is short enough, maybe the vanilla RNN could be a feasible choice for you.

--

--

Rice Yang
Rice Yang

Written by Rice Yang

A Tech Manager in AI. Experienced in NVIDIA, Alibaba, Pony.ai. Familiar with Deep Learning, Computer Vision, Software Engineering, Autonomous Driving

No responses yet