Recurrent Neural Network
A recurrent neural network (RNN) is a class of neural network designed for sequential data, where connections between nodes form directed cycles allowing information to persist across time steps.
A recurrent neural network (RNN) is a family of neural network architectures specifically designed to process sequential data. Unlike feedforward networks, which treat each input independently, RNNs maintain an internal hidden state that is updated at each time step, enabling the network to incorporate information from earlier in a sequence when processing later elements. This property makes RNNs naturally suited to tasks such as language modelling, speech recognition, machine translation, and time series forecasting.
Architecture
At each time step t, an RNN receives an input vector x_t and a hidden state h_(t-1) from the previous step. It produces a new hidden state h_t through a learned transformation:
h_t = f(W_h · h_(t-1) + W_x · x_t + b)
where W_h and W_x are weight matrices, b is a bias, and f is typically a non-linear activation function such as tanh or ReLU. The hidden state acts as the network's "memory" — a compact summary of the sequence seen so far. An output y_t is produced at each step by a separate learned projection, though many architectures only use the final step's output or pool across all outputs.
Training RNNs uses backpropagation through time (BPTT), an extension of standard backpropagation that unrolls the network's computation across all time steps and then computes gradients backwards through that unrolled graph.
Vanishing Gradient Problem
A fundamental limitation of vanilla RNNs is their difficulty in learning long-range dependencies. During BPTT, gradients are multiplied by the hidden-state weight matrix at every step. If this matrix has eigenvalues smaller than one, gradients shrink exponentially as they travel back in time — the vanishing gradient problem — causing the network to effectively ignore information from many steps earlier. This limits practical vanilla RNNs to relatively short sequences.
Long Short-Term Memory (LSTM)
The Long Short-Term Memory architecture, proposed by Sepp Hochreiter and Jürgen Schmidhuber in 1997, is the most widely deployed solution to the vanishing gradient problem. An LSTM unit replaces the simple hidden state with two state vectors: a hidden state h_t and a cell state C_t. The cell state acts as a long-term memory channel that can carry information across many time steps with minimal transformation.
Three gating mechanisms control information flow:
- Input gate: Decides what new information to store in the cell state.
- Forget gate: Decides what existing information to discard from the cell state.
- Output gate: Decides what part of the cell state to expose as the hidden state output.
These gates are themselves learned sigmoid functions, allowing the network to learn when to remember and when to forget. LSTMs proved effective for sequences of hundreds to thousands of steps and dominated sequence modelling tasks throughout the 2010s.
Gated Recurrent Unit (GRU)
The Gated Recurrent Unit, introduced by Cho et al. in 2014, is a simplified variant of the LSTM that merges the cell and hidden states and uses only two gates (reset and update). GRUs have fewer parameters than LSTMs, train faster, and achieve competitive or superior performance on many tasks. They have become a common alternative when computational efficiency is a concern.
Bidirectional RNNs
Standard RNNs process sequences left to right, so the hidden state at time t only incorporates information from positions 1 through t. Bidirectional RNNs run two separate recurrent layers — one forward, one backward — and concatenate their outputs. This allows each position's representation to incorporate context from both before and after it, which is beneficial for tasks like named entity recognition or sentiment analysis where full-sentence context matters.
Transition to Transformers
From approximately 2017 onwards, transformer architectures — which use self-attention rather than recurrence — have largely supplanted RNNs for natural language processing tasks where training data is abundant. Transformers process entire sequences in parallel during training (unlike RNNs, which are inherently sequential), and they scale more effectively with compute and data. However, RNNs retain advantages in streaming and online inference scenarios where inputs arrive one at a time and full-sequence parallelism is not possible. Architectures such as Mamba (2023) revisit state-space models as efficient alternatives that combine RNN-like sequential processing with improved long-range dependency handling.
| Architecture | Long-range dependencies | Training parallelism | Parameters | |---|---|---|---| | Vanilla RNN | Poor | Sequential | Low | | LSTM | Good | Sequential | Moderate | | GRU | Good | Sequential | Low–Moderate | | Transformer | Excellent | Parallel | High |
References
- Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
- Cho, K., van Merrienboer, B., Gulcehre, C., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
- Gu, A., & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.