AIWiki
Malaysia

Transformer Architecture

A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel, forming the foundation of modern large language models and multimodal AI systems.

7 min readLast updated May 2026Foundations

Transformer architecture is a neural network design that processes sequences of data using a mechanism called self-attention, allowing the model to weigh the relevance of each element in a sequence relative to every other element simultaneously. Introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google Brain, transformers replaced recurrent architectures as the dominant paradigm in natural language processing and subsequently expanded into vision, audio, and multimodal applications.[^1]

Background and Motivation

Prior to transformers, sequential data was processed primarily using recurrent neural networks (RNNs) and their variants, including long short-term memory (LSTM) networks. These architectures processed tokens one at a time in a fixed order, which introduced two significant limitations: they could not parallelise computation across a sequence during training, and they struggled to retain information across very long sequences due to the vanishing gradient problem.

Transformers addressed both limitations. By replacing sequential recurrence with self-attention, the architecture allows every position in a sequence to attend to every other position in a single operation, enabling full parallelisation and capturing long-range dependencies regardless of their distance in the input.

Core Components

Self-Attention

The self-attention mechanism is the fundamental operation of a transformer. For each token in an input sequence, the mechanism computes three vectors — a query (Q), a key (K), and a value (V) — derived by multiplying the token's embedding by learned weight matrices. The attention score between two tokens is computed as the dot product of the query vector of one token and the key vector of another, scaled by the square root of the key dimension to stabilise gradients. These scores are passed through a softmax function to produce a probability distribution, which is then used to compute a weighted sum of the value vectors.

The result for each token is a contextualised representation that reflects not just the token itself but its relationships with every other token in the sequence.

Multi-Head Attention

Rather than computing a single attention function, transformers apply the attention mechanism multiple times in parallel using different learned projections. Each parallel instance is called an "attention head," and each head can learn to attend to different types of relationships — syntactic structure, semantic similarity, coreference, and so on. The outputs from all heads are concatenated and linearly projected to produce the final attention output. The original paper used eight attention heads.

Feed-Forward Sub-layers

After each multi-head attention operation, a position-wise feed-forward network is applied independently to each token. This sub-layer consists of two linear transformations with a non-linear activation function (typically ReLU or GELU) in between. It allows the model to perform further transformation of each token's representation after the attention step has aggregated contextual information.

Positional Encoding

Unlike RNNs, transformers have no inherent notion of position in a sequence. Positional encodings are added to token embeddings before they enter the model to inject information about each token's position. The original paper used sinusoidal functions of different frequencies, though later models adopted learned positional embeddings or more sophisticated schemes such as rotary positional embeddings (RoPE).

Encoder and Decoder Stacks

The original transformer was designed for sequence-to-sequence tasks such as machine translation and consisted of two stacks. The encoder processes the input sequence and produces a set of contextualised representations. The decoder generates the output sequence autoregressively, attending to both its own previous outputs and the encoder's representations via cross-attention. Many subsequent models use only one of the two stacks: BERT-style models use only the encoder, while GPT-style models use only the decoder.

Variants and Descendants

Following the original paper, a large family of transformer-based models emerged:

| Model | Year | Architecture | Key Innovation | |-------|------|--------------|----------------| | BERT | 2018 | Encoder-only | Bidirectional pre-training via masked language modelling | | GPT-2 | 2019 | Decoder-only | Large-scale autoregressive language modelling | | T5 | 2019 | Encoder-decoder | Unified text-to-text framework | | GPT-3 | 2020 | Decoder-only | 175 billion parameters, few-shot in-context learning | | ViT | 2020 | Encoder-only | Applying transformer to image patches | | GPT-4 | 2023 | Decoder-only | Multimodal inputs, improved reasoning | | Stable Diffusion 3 | 2024 | Hybrid | Replaced U-Net with transformer (DiT) |

Transformers have expanded beyond text into image generation (via diffusion transformers), video synthesis, protein structure prediction (AlphaFold 2), speech recognition (Whisper), and code generation.

Computational Considerations

The self-attention operation has quadratic complexity with respect to sequence length: computing attention scores for a sequence of length n requires n² comparisons. For short sequences this is manageable, but for very long contexts — such as processing entire documents or high-resolution images — this becomes computationally expensive. Research into efficient attention variants (Longformer, FlashAttention, linear attention) has substantially reduced this cost in practice.

Hardware acceleration, particularly on graphics processing units (GPUs) and tensor processing units (TPUs), has enabled the training of transformer models with hundreds of billions of parameters. The FlashAttention algorithm, introduced in 2022, reorders attention computations to reduce memory reads and writes, yielding two to four times speedups and enabling training on longer sequences.[^2]

Applications

Transformers underpin virtually all state-of-the-art systems across multiple domains. In natural language processing, they power chatbots, machine translation, text summarisation, and question answering. In computer vision, vision transformers (ViTs) match or exceed convolutional networks on image classification benchmarks. In generative AI, diffusion transformers generate photorealistic images and video. In biology, transformer-based models predict protein structures and generate novel drug candidates.

See Also

References

References

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30.
  2. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems, 35.
  3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019.
  4. MDEC. (2025). Malaysia Digital Leaders Programme: AI Adoption Pathways. Malaysia Digital Economy Corporation.