AIWiki
Malaysia

Attention Mechanism

A neural network technique that enables models to dynamically weight the relevance of different parts of an input sequence when producing each output element, forming the core of transformer architectures.

6 min readLast updated May 2026Foundations

The attention mechanism is a fundamental building block in modern deep learning that allows a neural network to selectively concentrate on the most relevant portions of its input when generating each element of its output. Unlike earlier sequence-to-sequence architectures that compressed an entire input sequence into a single fixed-length vector, attention enables models to maintain and dynamically query a richer representation of the input at every step.

The concept was first formalised in 2015 by Dzmitri Bahdanau, Kyunghyun Cho, and Yoshua Bengio in the context of neural machine translation.[^1] Their model learned to align source-language words with target-language words without any explicit alignment labels, a capability that dramatically improved translation quality. Two years later, the landmark paper "Attention Is All You Need" by Vaswani et al. (2017) showed that attention alone—without recurrence or convolution—was sufficient to achieve state-of-the-art results on translation benchmarks, giving rise to the transformer architecture that now underpins virtually every large language model in use today.[^2]

How Attention Works

At its core, attention computes a weighted sum over a set of values, where the weights are determined by a compatibility function between a query and a set of keys. In practice, each token in an input sequence is projected into three distinct vectors: a query (Q), a key (K), and a value (V).

The query represents what the current token is "looking for." The keys represent what each other token "has to offer." The dot product of the query with each key produces a raw attention score, which is then scaled by the square root of the key dimension to prevent excessively large values from pushing the softmax function into regions with near-zero gradients. A softmax operation then normalises these scores into a probability distribution, and the resulting weights are used to form a weighted sum of the value vectors. The output is a context-aware representation of the current token that incorporates information from the entire sequence.[^3]

Formally, scaled dot-product attention is defined as:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

where dₖ is the dimensionality of the key vectors.

Self-Attention and Cross-Attention

When the queries, keys, and values all derive from the same sequence—a sentence attending to itself—the operation is called self-attention. Self-attention allows every token to gather contextual information from every other token in the same sequence, regardless of distance. This property overcomes the long-range dependency problem that plagued recurrent networks, where information from early positions could be "forgotten" by the time the model reached later positions.

Cross-attention, by contrast, allows one sequence to attend to a different sequence. In the encoder-decoder transformer used for translation, the decoder queries the encoder's output, so each generated target word can directly access any source word's representation. Cross-attention is also central to text-to-image diffusion models, where a visual latent representation attends to a text prompt.

Multi-Head Attention

A single attention operation captures one type of relationship between tokens. Multi-head attention runs several attention operations in parallel, each with independently learned Q, K, and V projection matrices. The outputs of all heads are concatenated and linearly projected back to the model's hidden dimension. This allows the model to jointly attend to information from different representation subspaces—for instance, one head may capture syntactic dependencies while another captures semantic similarities.[^2]

The number of attention heads is a key architectural hyperparameter. GPT-3, for example, uses 96 attention heads across its 96 transformer layers, while smaller models such as BERT-base use 12 heads across 12 layers.

Computational Complexity

The primary limitation of standard self-attention is its quadratic time and memory complexity with respect to sequence length: computing attention scores requires comparing every pair of tokens, resulting in O(n²) operations for a sequence of length n. This poses challenges for very long documents or high-resolution images. Research into efficient attention variants—including sparse attention, linear attention, and sliding-window attention (as used in Longformer and Mistral)—has sought to reduce this complexity while preserving the expressive power of full attention.[^4]

Applications Beyond Language

Although attention mechanisms are most associated with language models, they have proven highly effective in other domains. Vision Transformers (ViT) apply self-attention directly to sequences of image patches, achieving competitive performance on image classification benchmarks. In speech recognition, attention aligns acoustic frames with textual tokens. Graph neural networks use attention to aggregate neighbourhood information selectively, and time series forecasting models employ attention to identify the most informative historical time steps.

See Also

References

References

  1. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015. arXiv:1409.0473.
  2. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. arXiv:1706.03762.
  3. IBM. (2024). What is an attention mechanism? IBM Think. https://www.ibm.com/think/topics/attention-mechanism
  4. Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv:2004.05150.
  5. Agmo Group. (2024). Agmo Sovereign AI: Merdeka LLM. https://www.agmo.group/agmo-sovereign-ai-merdeka-llm/