Encoder-Decoder Architecture
A neural network design pattern that compresses an input sequence into an internal representation using an encoder, and then generates an output sequence from that representation using a decoder, foundational to machine translation, summarisation, and many other sequence-to-sequence tasks.
The encoder-decoder architecture is a neural-network design pattern in which one network (the encoder) compresses an input sequence into an intermediate representation, and a second network (the decoder) generates an output sequence by conditioning on that representation. The pattern was introduced in 2014 by Sutskever, Vinyals, and Le, and independently by Cho and colleagues, in the context of recurrent neural networks for machine translation. It remains one of the most influential abstractions in deep learning and underpins systems for translation, summarisation, speech recognition, optical character recognition, image captioning, and structured-data generation.
The basic idea
In the original sequence-to-sequence formulation, the encoder reads a source sequence one token at a time, updating its hidden state with a recurrent unit such as an LSTM or a GRU. After consuming the full input, the encoder's final hidden state forms a fixed-length vector representation of the entire input — a compressed summary. The decoder is initialised from this representation and generates the target sequence one token at a time, with each generated token conditioning on the encoder representation and on the tokens generated so far.
This abstraction cleanly separates the act of understanding the input from the act of producing the output, and it can handle source and target sequences of different lengths and even different modalities — for example, audio in and text out, or pixels in and text out.
The attention extension
The original recurrent encoder-decoder suffered from a fundamental bottleneck: every input sequence, however long, had to be compressed into a single fixed-length vector. Performance on long sentences degraded as the bottleneck saturated. In 2015, Bahdanau and colleagues introduced the attention mechanism, which allowed the decoder to attend dynamically to any position in the encoder's sequence of hidden states rather than relying on the final state alone. Attention turned the encoder output from a single vector into a set of contextualised vectors, with the decoder choosing what to look at at each generation step.
The combination of encoder, decoder, and attention quickly became the standard architecture for neural machine translation and remained dominant until 2017.
Transformer encoder-decoder
In 2017, Vaswani and colleagues replaced the recurrent units of the encoder and decoder with stacks of self-attention layers, producing the transformer encoder-decoder. The encoder applies bidirectional self-attention across all input tokens at every layer, and the decoder applies causal (masked) self-attention plus cross-attention onto the encoder output. Because attention is not sequential, the transformer can process input tokens in parallel, dramatically improving training throughput.
Transformer encoder-decoder models such as T5 (Text-to-Text Transfer Transformer), BART, mBART, MarianMT, M2M-100, and NLLB-200 are the workhorses of modern machine translation, document summarisation, and text-to-text reformatting. OpenAI's Whisper speech-recognition model uses an encoder-decoder design in which the encoder processes mel-spectrograms of audio and the decoder produces text.
Encoder-only and decoder-only variants
The encoder-decoder pattern is one of three transformer configurations. Encoder-only models such as BERT and RoBERTa drop the decoder and produce contextual embeddings, optimised for understanding tasks. Decoder-only models such as GPT, Llama, Mistral, and Claude drop the encoder and treat every task as next-token prediction; this design now dominates large language modelling because it scales well and trains efficiently on web-scale text. Encoder-decoder models retain a clear advantage in tasks with a strong asymmetry between input and output — translation, summarisation, transcription — and in tasks where the output benefits from rich bidirectional input context.
| Variant | Examples | Best at | |---------|----------|---------| | Encoder-only | BERT, RoBERTa | Classification, embedding | | Decoder-only | GPT, Llama, Claude | Generation, chat, code | | Encoder-decoder | T5, BART, Whisper, NLLB | Translation, summarisation, speech |
Cross-modal applications
The encoder-decoder pattern generalises beyond text. Image-captioning models pair a convolutional or vision-transformer encoder with a text decoder. Optical character recognition models such as TrOCR and Donut use a vision encoder and a text decoder. Speech-recognition systems such as Whisper and OWSM follow the same shape. Multimodal models that translate between modalities — text-to-speech, text-to-image with a diffusion decoder, and protein-sequence-to-structure models such as AlphaFold — all build on the encoder-decoder abstraction.
See Also
References
References
- Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. NeurIPS 2014.
- Cho, K., et al. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP 2014.
- Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015.
- Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017.
- NLLB Team, Meta AI. (2022). No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672.