AIWiki
Malaysia

Sequence-to-Sequence Model

A neural network architecture composed of an encoder that processes an input sequence into a fixed representation and a decoder that generates an output sequence from that representation, forming the foundation for machine translation, summarisation, and dialogue systems.

7 min readLast updated June 2026Foundations

A Sequence-to-Sequence (Seq2Seq) model is a neural network architecture that takes a sequence of inputs — such as a sentence in French — and produces a corresponding output sequence of potentially different length and structure — such as its translation in English. Seq2Seq models are built on the encoder-decoder paradigm: an encoder network processes the entire input sequence and compresses its meaning into an internal representation, and a decoder network generates the output sequence from that representation, one token at a time.

Introduced in its modern neural form by Sutskever, Vinyals, and Le at Google in 2014, Seq2Seq fundamentally changed the approach to machine translation and subsequently became the foundational architecture for text summarisation, dialogue systems, speech recognition, code generation, and many other tasks that involve transforming one sequence into another. While the original implementation used recurrent neural networks (RNNs), the Seq2Seq principle was later unified with attention mechanisms and, ultimately, realised in its most powerful form as the Transformer architecture.

Architecture

Encoder

The encoder processes the input sequence token by token, updating its internal hidden state at each step. For a sequence of n tokens, the encoder runs for n steps and produces a final hidden state — sometimes called the context vector — that is intended to capture the meaning of the entire input sequence. In early RNN-based Seq2Seq models, this was a single fixed-length vector regardless of input length.

The encoder produces either:

  • A single context vector (original Seq2Seq formulation)
  • A sequence of hidden states, one per input token (used with attention mechanisms)

Decoder

The decoder generates the output sequence autoregressively: at each step, it takes its previous hidden state, the context from the encoder, and the token it produced at the previous step, and predicts the next output token. Generation continues until the decoder produces a special end-of-sequence token.

At inference time, two decoding strategies are common: greedy decoding, which always selects the highest-probability token at each step, and beam search, which maintains multiple candidate sequences (beams) simultaneously and selects the highest-probability complete sequence at the end. Beam search generally produces higher-quality outputs at the cost of additional computation.

The Context Bottleneck Problem

The original Seq2Seq formulation compressed the entire input sequence into a single fixed-length context vector. For long sequences, this created a bottleneck: the encoder had to represent all information from a long input in a single vector, leading to information loss. Performance on long sentences degraded significantly.

This limitation motivated the development of the attention mechanism by Bahdanau et al. (2015). Attention allows the decoder, at each generation step, to consult the full sequence of encoder hidden states and selectively focus on the parts of the input most relevant to the current output token, rather than relying solely on the single context vector. Attention-augmented Seq2Seq models significantly outperformed the original formulation on long sequences and became the de facto standard.

Evolution into Transformers

The Transformer architecture (Vaswani et al., 2017) can be understood as a highly parallelised generalisation of the attention-augmented Seq2Seq model. Rather than processing sequences step by step with RNNs, the Transformer encoder processes all input tokens simultaneously using self-attention, and the Transformer decoder generates outputs with masked self-attention (attending only to previously generated tokens) plus cross-attention to encoder outputs. This enabled massively parallel training on GPUs and scaling to much larger datasets and model sizes.

Most modern large language models that perform translation, summarisation, or question answering — including the encoder-decoder models T5 and BART, and the decoder-only GPT family — are direct descendants of the Seq2Seq principle, enhanced by the Transformer's parallelism and scalability.

Applications

Seq2Seq models underpin a wide range of production NLP systems:

Machine translation: Google Translate, DeepL, and Microsoft Translator use Transformer-based Seq2Seq architectures. Neural machine translation (NMT) using Seq2Seq replaced the previous generation of phrase-based statistical machine translation after 2016.

Text summarisation: Seq2Seq models produce abstractive summaries by encoding a document and decoding a shorter, rephrased version, rather than merely extracting sentences.

Dialogue and chatbot systems: Conversational AI systems that generate responses given a conversation history use Seq2Seq architectures, with the conversation context as input and the response as output.

Code generation: Systems such as GitHub Copilot use Seq2Seq or decoder-only transformer variants to translate natural language descriptions (input sequence) into programming language code (output sequence).

Speech recognition (ASR): End-to-end ASR systems, including OpenAI's Whisper, use encoder-decoder architectures where audio features are encoded and decoded into text.

Optical character recognition: Modern OCR systems use Seq2Seq to convert image features (encoded from a CNN backbone) into character sequences.

Comparison of Seq2Seq Variants

| Variant | Encoder | Decoder | Attention | |---|---|---|---| | Original Seq2Seq (2014) | LSTM | LSTM | None | | Attention Seq2Seq (2015) | Bi-LSTM | LSTM | Bahdanau attention | | Transformer-based (2017+) | Multi-head self-attention | Masked multi-head attention | Cross-attention | | T5 / BART | Transformer encoder | Transformer decoder | Cross-attention |

References

  1. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 27.
  2. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. Proceedings of ICLR 2015.
  3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
  4. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1-67.
  5. Analytics Vidhya. (2024). Sequence-to-Sequence models for language translation. analyticsvidhya.com.