What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Grouped-Query Attention (GQA)

An attention variant that partitions query heads into groups sharing a single set of keys and values, reducing memory bandwidth during inference while retaining most of the quality of full multi-head attention.

4 min readLast updated July 2026Foundations

Grouped-Query Attention, abbreviated GQA, is a variant of the attention mechanism used in transformer models that reduces the memory and bandwidth cost of inference while preserving most of the accuracy of standard attention. It was introduced by Google researchers in the May 2023 paper "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." GQA partitions the query heads of a transformer into several groups, with each group sharing a single set of key and value projections, striking a balance between two earlier designs.

The problem GQA solves

In standard multi-head attention, each attention head has its own separate query, key and value projections. During autoregressive generation, the keys and values for all previously generated tokens are stored in a structure called the key-value cache, or KV cache, so they do not have to be recomputed at each step. The size of this cache grows with the number of heads, the sequence length and the batch size. Reading it from memory at every generation step becomes the dominant bottleneck for large models producing long outputs, because the operation is limited by memory bandwidth rather than raw computation.

Multi-query attention, an earlier proposal, addresses this by having all query heads share one single set of keys and values. This shrinks the KV cache dramatically and speeds up generation, but the aggressive sharing can reduce model quality and make training less stable.

The GQA approach

Grouped-query attention interpolates between the two extremes. Instead of one shared key-value set for every head, or a separate set per head, the query heads are divided into a small number of groups, and each group shares one key-value set. If the number of groups equals the number of heads, GQA reduces to ordinary multi-head attention; if there is only one group, it reduces to multi-query attention. Typical configurations use eight key-value groups, which captures most of the memory savings while keeping accuracy close to full multi-head attention.

The table summarises the three schemes.

| Scheme | Key-value sets | KV cache size | Quality | | --- | --- | --- | --- | | Multi-head attention | One per head | Largest | Highest | | Grouped-query attention | One per group | Moderate | Near multi-head | | Multi-query attention | One shared | Smallest | Slightly reduced |

An appealing practical detail is that GQA can be created by "uptraining" an existing multi-head model. The key and value projections of each group are combined by averaging, and the model is then trained for a short additional period to recover quality, avoiding the cost of training from scratch.

Impact and adoption

GQA has become a standard component of large language model design. Meta adopted it for the larger Llama 2 models in July 2023 and retained it across the Llama 3 family released in 2024. Mistral, Qwen and many other open-weight models use GQA as well. Its widespread use reflects a broader trend in which architectural choices are increasingly driven by the economics of inference serving, since a smaller KV cache allows longer context windows, larger batch sizes and lower serving cost per request.

Because GQA reduces the memory footprint of each request, it interacts closely with serving systems that use paged KV cache management and continuous batching to maximise the number of concurrent users a single accelerator can support.

Malaysian Context — Efficient Inference for Local Deployment

Inference efficiency techniques such as grouped-query attention are directly relevant to Malaysia's push to run large language models on domestic infrastructure. The YTL AI Cloud in Kulai, Johor, powered by Nvidia Grace Blackwell GPUs, hosts workloads including ILMU, Malaysia's sovereign large language model. Serving such models cost-effectively to banks, government agencies and enterprises depends on techniques that shrink the KV cache and raise throughput, and GQA is one of the most important.

For Malaysian financial institutions such as Maybank and CIMB exploring on-premise or private-cloud AI assistants, memory-efficient attention allows longer customer conversations and larger document contexts to be handled on a fixed hardware budget, which supports compliance with Bank Negara Malaysia guidance on keeping sensitive data within controlled environments.

Malaysian AI engineers trained through HRD Corp initiatives and university programmes routinely encounter GQA when fine-tuning and deploying open-weight models such as Llama and Mistral. Understanding it is essential for teams at MIMOS, local technology firms and cloud service providers who must optimise the cost of running generative AI services for the Southeast Asian market.

References

Ainslie, J. et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv:2305.13245.
Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150.
IBM. (2024). What is grouped query attention (GQA)?. IBM Think Topics.

Tags:attention transformers inference optimization large language models

Type	Attention mechanism variant
Introduced	May 2023 (Google Research)
Purpose	Efficient LLM inference
Sits between	Multi-head and multi-query attention
Adopted in	Llama 2, Llama 3, Mistral, and others
Related	KV cache, Attention mechanism