AIWiki
Malaysia

Long Short-Term Memory (LSTM)

Long Short-Term Memory is a recurrent neural network architecture designed to learn long-range dependencies in sequential data by using gating mechanisms to control information flow.

5 min readLast updated May 2026Foundations

Long Short-Term Memory (LSTM) is a specialised recurrent neural network (RNN) architecture introduced by Sepp Hochreiter and Jürgen Schmidhuber in their 1997 paper published in the journal Neural Computation. LSTMs were designed to overcome a fundamental limitation of standard RNNs: the vanishing gradient problem, which made it extremely difficult for traditional networks to learn dependencies spanning long sequences. By introducing a set of gating mechanisms that regulate information flow, LSTMs became one of the most influential architectures in deep learning for sequential data throughout the 2000s and 2010s.

Architecture

The defining innovation of an LSTM cell is its explicit memory cell state — a pathway that allows information to persist over many time steps with minimal modification. Unlike standard RNN hidden states, which are overwritten at every step, the LSTM cell state is designed to carry relevant information forward while discarding what is no longer needed.

Each LSTM unit contains three gates that control this process:

The forget gate decides what portion of the previous cell state should be discarded. It takes the previous hidden state and the current input, passes them through a sigmoid activation, and produces a value between 0 and 1 for each dimension of the cell state. A value near 0 means "forget", while a value near 1 means "keep".

The input gate determines what new information should be written into the cell state. It combines a sigmoid layer (which selects which values to update) with a tanh layer (which creates candidate values). The two are multiplied together to produce the update.

The output gate controls what portion of the cell state is exposed as the hidden state at the current time step. The cell state is passed through a tanh function and multiplied by the output gate to produce the final hidden state, which is passed to the next time step and used for any downstream predictions.

Together, these three gates give LSTMs the ability to selectively remember and forget information across sequences of hundreds or even thousands of time steps — a capability that standard RNNs lack in practice.

Gated Recurrent Unit

A notable variant of the LSTM is the Gated Recurrent Unit (GRU), introduced by Cho et al. in 2014. The GRU simplifies the LSTM by merging the forget and input gates into a single update gate and eliminating the separate cell state. GRUs have fewer parameters and can train faster, while achieving comparable performance on many tasks. The choice between LSTM and GRU is often empirical and task-dependent.

Applications

LSTMs achieved state-of-the-art results across a wide range of tasks involving sequential or temporal data.

Natural language processing: LSTMs became the dominant architecture for machine translation, sentiment analysis, named entity recognition, and language modelling during the mid-2010s, before being largely superseded by Transformer-based models after 2017.

Speech recognition: LSTMs are used in acoustic models for converting speech waveforms into phoneme sequences. Deep bidirectional LSTMs — which process sequences in both directions — were central to Google's neural speech recognition system released in 2015.

Time series forecasting: In finance, manufacturing, and energy, LSTMs model patterns in historical data to predict future values such as stock prices, electricity demand, or equipment sensor readings.

Anomaly detection: LSTMs learn the expected pattern of a time series and flag deviations, making them useful for fraud detection, network intrusion detection, and industrial predictive maintenance.

Healthcare: LSTMs are applied to electronic health records, analysing sequences of clinical events — diagnoses, medications, lab results — to predict patient outcomes such as hospital readmission or disease progression.

Relationship to Transformers

From roughly 2018 onwards, Transformer-based architectures — beginning with BERT and GPT — largely displaced LSTMs in NLP tasks. Transformers process entire sequences in parallel using self-attention rather than sequentially step by step, enabling far more efficient use of modern GPU hardware. They also scale more effectively with data and model size.

However, LSTMs have not disappeared. They remain competitive or preferred in settings where sequences are extremely long, memory-constrained hardware is used, or real-time streaming inference is required. LSTMs also underpin many production systems that predate the Transformer era and have not yet been replaced due to operational continuity requirements.

References

  1. Hochreiter, S., and Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780.
  2. Cho, K., et al. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078.
  3. Greff, K., et al. (2017). LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222-2232.
  4. Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10), 2451-2471.