What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Transformer Architecture

A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel, forming the foundation of modern large language models and multimodal AI systems.

7 min readLast updated May 2026Foundations

Transformer architecture is a neural network design that processes sequences of data using a mechanism called self-attention, allowing the model to weigh the relevance of each element in a sequence relative to every other element simultaneously. Introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google Brain, transformers replaced recurrent architectures as the dominant paradigm in natural language processing and subsequently expanded into vision, audio, and multimodal applications.[^1]

Background and Motivation

Prior to transformers, sequential data was processed primarily using recurrent neural networks (RNNs) and their variants, including long short-term memory (LSTM) networks. These architectures processed tokens one at a time in a fixed order, which introduced two significant limitations: they could not parallelise computation across a sequence during training, and they struggled to retain information across very long sequences due to the vanishing gradient problem.

Transformers addressed both limitations. By replacing sequential recurrence with self-attention, the architecture allows every position in a sequence to attend to every other position in a single operation, enabling full parallelisation and capturing long-range dependencies regardless of their distance in the input.

Core Components

Self-Attention

The self-attention mechanism is the fundamental operation of a transformer. For each token in an input sequence, the mechanism computes three vectors — a query (Q), a key (K), and a value (V) — derived by multiplying the token's embedding by learned weight matrices. The attention score between two tokens is computed as the dot product of the query vector of one token and the key vector of another, scaled by the square root of the key dimension to stabilise gradients. These scores are passed through a softmax function to produce a probability distribution, which is then used to compute a weighted sum of the value vectors.

The result for each token is a contextualised representation that reflects not just the token itself but its relationships with every other token in the sequence.

Multi-Head Attention

Rather than computing a single attention function, transformers apply the attention mechanism multiple times in parallel using different learned projections. Each parallel instance is called an "attention head," and each head can learn to attend to different types of relationships — syntactic structure, semantic similarity, coreference, and so on. The outputs from all heads are concatenated and linearly projected to produce the final attention output. The original paper used eight attention heads.

Feed-Forward Sub-layers

After each multi-head attention operation, a position-wise feed-forward network is applied independently to each token. This sub-layer consists of two linear transformations with a non-linear activation function (typically ReLU or GELU) in between. It allows the model to perform further transformation of each token's representation after the attention step has aggregated contextual information.

Positional Encoding

Unlike RNNs, transformers have no inherent notion of position in a sequence. Positional encodings are added to token embeddings before they enter the model to inject information about each token's position. The original paper used sinusoidal functions of different frequencies, though later models adopted learned positional embeddings or more sophisticated schemes such as rotary positional embeddings (RoPE).

Encoder and Decoder Stacks

The original transformer was designed for sequence-to-sequence tasks such as machine translation and consisted of two stacks. The encoder processes the input sequence and produces a set of contextualised representations. The decoder generates the output sequence autoregressively, attending to both its own previous outputs and the encoder's representations via cross-attention. Many subsequent models use only one of the two stacks: BERT-style models use only the encoder, while GPT-style models use only the decoder.

Variants and Descendants

Following the original paper, a large family of transformer-based models emerged:

| Model | Year | Architecture | Key Innovation | |-------|------|--------------|----------------| | BERT | 2018 | Encoder-only | Bidirectional pre-training via masked language modelling | | GPT-2 | 2019 | Decoder-only | Large-scale autoregressive language modelling | | T5 | 2019 | Encoder-decoder | Unified text-to-text framework | | GPT-3 | 2020 | Decoder-only | 175 billion parameters, few-shot in-context learning | | ViT | 2020 | Encoder-only | Applying transformer to image patches | | GPT-4 | 2023 | Decoder-only | Multimodal inputs, improved reasoning | | Stable Diffusion 3 | 2024 | Hybrid | Replaced U-Net with transformer (DiT) |

Transformers have expanded beyond text into image generation (via diffusion transformers), video synthesis, protein structure prediction (AlphaFold 2), speech recognition (Whisper), and code generation.

Computational Considerations

The self-attention operation has quadratic complexity with respect to sequence length: computing attention scores for a sequence of length n requires n² comparisons. For short sequences this is manageable, but for very long contexts — such as processing entire documents or high-resolution images — this becomes computationally expensive. Research into efficient attention variants (Longformer, FlashAttention, linear attention) has substantially reduced this cost in practice.

Hardware acceleration, particularly on graphics processing units (GPUs) and tensor processing units (TPUs), has enabled the training of transformer models with hundreds of billions of parameters. The FlashAttention algorithm, introduced in 2022, reorders attention computations to reduce memory reads and writes, yielding two to four times speedups and enabling training on longer sequences.[^2]

Applications

Transformers underpin virtually all state-of-the-art systems across multiple domains. In natural language processing, they power chatbots, machine translation, text summarisation, and question answering. In computer vision, vision transformers (ViTs) match or exceed convolutional networks on image classification benchmarks. In generative AI, diffusion transformers generate photorealistic images and video. In biology, transformer-based models predict protein structures and generate novel drug candidates.

Malaysian Context — Transformer Adoption in Research and Industry

Malaysia's AI ecosystem has engaged with transformer-based models primarily through cloud platforms and applied research. The Malaysia Digital Economy Corporation (MDEC), under the Ministry of Digital, has promoted the adoption of cloud-based AI services — including transformer-backed APIs from Google Cloud, Microsoft Azure, and Amazon Web Services — through its Digital Leaders programme and SME digitalisation grants under the Madani Digital initiative.

In the financial sector, Maybank's partnership with Microsoft, announced in 2024 and valued at RM1 billion, involves deploying Azure OpenAI services, which are built on GPT-series transformer models, across the group's operations. CIMB has similarly deployed transformer-based generative AI for customer-facing chatbots, reporting 94% accuracy in handling routine banking enquiries.

Malaysian universities have contributed to transformer research, particularly Universiti Malaya, Universiti Teknologi Malaysia (UTM), and Universiti Putra Malaysia (UPM), with faculty publishing on transformer applications in Malay-language NLP, medical record processing, and remote sensing. The existence of BERTa — a transformer model pre-trained on Malay-language corpora — reflects growing local investment in adapting the architecture for the national language.

For enterprises, HRDC Corp (formerly HRD Corp) has listed transformer and large language model literacy as part of its approved training programmes, enabling employers to claim subsidies for upskilling staff in AI technologies. The National AI Office (NAIO), established in December 2024 under the Prime Minister's Department, includes transformer-based foundation models within its scope for AI governance and national capacity building.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems, 35.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019.
MDEC. (2025). Malaysia Digital Leaders Programme: AI Adoption Pathways. Malaysia Digital Economy Corporation.

Tags:transformer attention mechanism deep learning neural network

Type	Neural network architecture
Introduced	2017 (Vaswani et al.)
Key innovation	Self-attention mechanism
Key use	Language models, image generation, speech recognition
Related	Attention mechanism, BERT, GPT, encoder-decoder

Background and Motivation

Core Components

Self-Attention

Multi-Head Attention

Feed-Forward Sub-layers

Positional Encoding

Encoder and Decoder Stacks

Variants and Descendants

Computational Considerations

Applications

See Also

References

References