What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Attention Mechanism

A neural network technique that enables models to dynamically weight the relevance of different parts of an input sequence when producing each output element, forming the core of transformer architectures.

6 min readLast updated May 2026Foundations

The attention mechanism is a fundamental building block in modern deep learning that allows a neural network to selectively concentrate on the most relevant portions of its input when generating each element of its output. Unlike earlier sequence-to-sequence architectures that compressed an entire input sequence into a single fixed-length vector, attention enables models to maintain and dynamically query a richer representation of the input at every step.

The concept was first formalised in 2015 by Dzmitri Bahdanau, Kyunghyun Cho, and Yoshua Bengio in the context of neural machine translation.[^1] Their model learned to align source-language words with target-language words without any explicit alignment labels, a capability that dramatically improved translation quality. Two years later, the landmark paper "Attention Is All You Need" by Vaswani et al. (2017) showed that attention alone—without recurrence or convolution—was sufficient to achieve state-of-the-art results on translation benchmarks, giving rise to the transformer architecture that now underpins virtually every large language model in use today.[^2]

How Attention Works

At its core, attention computes a weighted sum over a set of values, where the weights are determined by a compatibility function between a query and a set of keys. In practice, each token in an input sequence is projected into three distinct vectors: a query (Q), a key (K), and a value (V).

The query represents what the current token is "looking for." The keys represent what each other token "has to offer." The dot product of the query with each key produces a raw attention score, which is then scaled by the square root of the key dimension to prevent excessively large values from pushing the softmax function into regions with near-zero gradients. A softmax operation then normalises these scores into a probability distribution, and the resulting weights are used to form a weighted sum of the value vectors. The output is a context-aware representation of the current token that incorporates information from the entire sequence.[^3]

Formally, scaled dot-product attention is defined as:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

where dₖ is the dimensionality of the key vectors.

Self-Attention and Cross-Attention

When the queries, keys, and values all derive from the same sequence—a sentence attending to itself—the operation is called self-attention. Self-attention allows every token to gather contextual information from every other token in the same sequence, regardless of distance. This property overcomes the long-range dependency problem that plagued recurrent networks, where information from early positions could be "forgotten" by the time the model reached later positions.

Cross-attention, by contrast, allows one sequence to attend to a different sequence. In the encoder-decoder transformer used for translation, the decoder queries the encoder's output, so each generated target word can directly access any source word's representation. Cross-attention is also central to text-to-image diffusion models, where a visual latent representation attends to a text prompt.

Multi-Head Attention

A single attention operation captures one type of relationship between tokens. Multi-head attention runs several attention operations in parallel, each with independently learned Q, K, and V projection matrices. The outputs of all heads are concatenated and linearly projected back to the model's hidden dimension. This allows the model to jointly attend to information from different representation subspaces—for instance, one head may capture syntactic dependencies while another captures semantic similarities.[^2]

The number of attention heads is a key architectural hyperparameter. GPT-3, for example, uses 96 attention heads across its 96 transformer layers, while smaller models such as BERT-base use 12 heads across 12 layers.

Computational Complexity

The primary limitation of standard self-attention is its quadratic time and memory complexity with respect to sequence length: computing attention scores requires comparing every pair of tokens, resulting in O(n²) operations for a sequence of length n. This poses challenges for very long documents or high-resolution images. Research into efficient attention variants—including sparse attention, linear attention, and sliding-window attention (as used in Longformer and Mistral)—has sought to reduce this complexity while preserving the expressive power of full attention.[^4]

Applications Beyond Language

Although attention mechanisms are most associated with language models, they have proven highly effective in other domains. Vision Transformers (ViT) apply self-attention directly to sequences of image patches, achieving competitive performance on image classification benchmarks. In speech recognition, attention aligns acoustic frames with textual tokens. Graph neural networks use attention to aggregate neighbourhood information selectively, and time series forecasting models employ attention to identify the most informative historical time steps.

Malaysian Context — Attention in Local AI Development

Malaysia's growing AI ecosystem has begun to engage directly with transformer and attention-based models for local language tasks. The Merdeka LLM initiative, developed by Agmo Group in collaboration with Malaysian government-linked bodies, applies transformer self-attention to Bahasa Malaysia text, addressing the relative scarcity of high-quality Malay training data compared with English.[^5] The model is intended to support sovereign AI objectives under Malaysia's MyDigital Blueprint, which emphasises developing domestic AI capabilities rather than relying entirely on foreign models.

MDEC (Malaysia Digital Economy Corporation) has funded several research collaborations between Malaysian universities—including Universiti Malaya (UM) and Universiti Teknologi Malaysia (UTM)—and industry partners to fine-tune transformer-based models on local datasets covering legal documents, medical records in Malay, and multilingual code-switching text common in Malaysian communication. These projects require a firm grounding in attention mechanics to adapt pre-trained models to the nuances of Malaysian English, Bahasa Malaysia, and Mandarin dialects.

Financial institutions such as Maybank and CIMB deploy attention-based NLP models in their customer service and document processing pipelines. Sentiment analysis of customer feedback, automatic summarisation of loan agreements, and intent detection in chatbot interactions all rely on the contextual representations produced by transformer self-attention. Bank Negara Malaysia (BNM) has noted in its Financial Technology Regulatory Sandbox guidance that explainability requirements for AI in credit decisioning create demand for attention visualisation techniques, which allow compliance teams to see which parts of an applicant's financial history the model weighted most heavily.

For students and professionals entering the AI field, HRD Corp (Human Resource Development Corporation) has approved several transformer-focused training programmes offered by local upskilling providers, recognising attention mechanisms as a core competency for Malaysian AI practitioners.

References

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015. arXiv:1409.0473.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. arXiv:1706.03762.
IBM. (2024). What is an attention mechanism? IBM Think. https://www.ibm.com/think/topics/attention-mechanism
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv:2004.05150.
Agmo Group. (2024). Agmo Sovereign AI: Merdeka LLM. https://www.agmo.group/agmo-sovereign-ai-merdeka-llm/

Tags:attention transformer self-attention deep-learning

Type	Neural network component
Introduced	2015 (Bahdanau et al.); popularised 2017 (Vaswani et al.)
Key use	Sequence modelling, language understanding, image recognition
Variants	Self-attention, cross-attention, multi-head attention
Related	Transformer architecture, BERT, GPT, Large Language Models

How Attention Works

Self-Attention and Cross-Attention

Multi-Head Attention

Computational Complexity

Applications Beyond Language

See Also

References

References