What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Mixture of Experts

Mixture of Experts (MoE) is a machine learning architecture in which a model routes each input to a small subset of specialised sub-networks called experts, enabling large model capacity at a fraction of the compute cost.

6 min readLast updated June 2026Foundations

Mixture of Experts (MoE) is a machine learning architecture in which a model is composed of multiple specialised sub-networks, each called an expert, together with a gating network that selects which experts process a given input. Instead of routing every input through every parameter in the model, only a small fraction of experts are activated per token or per sample, allowing the overall parameter count to grow substantially without a proportional increase in computation.

Historical Background

The concept of mixing specialised experts dates to work by Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton in 1991. Their original formulation used a gating network to compute a weighted average over competing networks trained on different regions of the input space. For decades, MoE remained a relatively niche research technique applied to comparatively small models.

Interest was revived when Google researchers applied MoE to transformer models. The Sparsely-Gated Mixture-of-Experts layer (2017) demonstrated that MoE could be applied to language modelling at scale, offering orders-of-magnitude increases in parameter count with manageable inference costs. The Switch Transformer (2021) and subsequent GLaM model pushed this further, establishing MoE as a serious approach for large language models.

Architecture

In a modern MoE transformer, the feed-forward network (FFN) sublayer inside each transformer block is replaced by a set of N expert FFNs. A lightweight router network — typically a learned linear projection followed by a softmax — assigns a score to each expert for a given token. The top-K experts by score are selected and their outputs are combined, usually via a weighted sum proportional to the gating scores.

A model described as having N experts with top-K routing activates K of N experts per token. DeepSeek-V3, for example, has 671 billion total parameters but activates only 37 billion during inference by using top-K routing among 256 experts. Kimi K2, released in 2025, extends this to 1 trillion total parameters with 384 experts and top-8 routing.

The attention sublayers in MoE transformers typically remain dense; sparsity applies to the FFN sublayers only. This hybrid design preserves the global attention capacity that makes transformers effective at long-range reasoning while achieving compute savings through sparse expert routing.

Load Balancing

A persistent challenge in MoE training is expert collapse, where the router learns to send most tokens to a small number of experts, leaving the remainder underutilised. Modern MoE implementations address this through auxiliary load-balancing losses that penalise unequal expert utilisation, or through techniques such as expert-choice routing, where each expert selects its own top-K tokens rather than each token selecting its own top-K experts.

Efficiency and Scaling

The appeal of MoE is that it decouples model capacity from compute. A dense model with P parameters uses all P parameters for every token during both training and inference. An MoE model with the same nominal parameter count uses only a fraction of them per forward pass, resulting in lower floating-point operations per token while retaining access to the full knowledge capacity of all experts.

In practice, MoE models often achieve better perplexity per training FLOP than comparably sized dense models. DeepSeek-V3 was trained at a reported cost of approximately USD 5.5 million, significantly less than would be required for a comparably capable dense model. By 2025, MoE architectures accounted for the majority of open-source frontier model releases.

The trade-off is infrastructure complexity. MoE models require more GPU memory to load all experts, even those not active for a given batch. At high batch sizes the per-token savings accumulate, but at very low batch sizes the memory overhead can outweigh the compute savings. Serving MoE models efficiently requires careful attention to expert parallelism, routing overhead, and communication costs in distributed inference clusters.

Comparison with Dense Models

| Property | Dense Model | MoE Model | |---|---|---| | Parameters used per token | All | Top-K experts only | | Training compute per token | High | Lower | | Inference memory | Proportional to active params | Full model must be loaded | | Expert specialisation | None | Implicit, emergent | | Implementation complexity | Low | Higher |

Applications

MoE architectures are used primarily in large language models, where scaling model capacity is beneficial but compute budgets are constrained. They have also been applied to multimodal models combining vision and language, and to mixture-of-modality routing where different experts handle different input types. Research in 2025 explored applying MoE principles to diffusion models and to the attention sublayer itself through sparse attention routing.

Malaysian Context — MoE and AI Infrastructure

Malaysia's national AI strategy, as articulated in the Malaysia AI Roadmap and MyDigital Blueprint, emphasises building local AI capability while participating in the global model ecosystem. MoE architectures are directly relevant to these goals because they lower the per-token inference cost of frontier models, making deployment more economically viable for local cloud providers and enterprises.

MDEC and the National AI Office Malaysia have highlighted AI infrastructure as a priority investment area. Several hyperscaler data centres established in Malaysia — including facilities operated by Microsoft, Google, and AWS — run inference workloads on MoE models for enterprise customers in the region. Malaysian banks such as Maybank and CIMB, and telecommunications companies including TM and Maxis, access frontier AI capabilities through these providers and benefit indirectly from the efficiency gains that MoE enables.

Local AI companies and startups building on top of open-source MoE models such as Mixtral, DeepSeek-V3, and Kimi K2 can leverage their high capability-to-cost ratio, which is particularly important in a market where inference budgets are more constrained than in larger economies. HRD Corp and university AI programmes such as those at Universiti Malaya and UTM have begun including MoE concepts in advanced AI curricula as the architecture moves from research novelty to industry standard.

The Malaysian government's investment attraction efforts under Malaysia Digital (formerly MSC Malaysia) have drawn AI chip distributors and inference-as-a-service providers, meaning that enterprises in Malaysia increasingly have access to the specialised GPU infrastructure required to deploy MoE models efficiently. Petronas, AirAsia, and Grab Malaysia are among the larger organisations that have evaluated foundation model deployment for internal applications where MoE efficiency directly affects operational cost.

References

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87.
Shazeer, N. et al. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. Proceedings of ICLR 2017.
Fedus, W., Zoph, B., and Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120).
DeepSeek-AI. (2024). DeepSeek-V3 technical report. arXiv:2412.19437.
NVIDIA. (2025). Mixture of Experts powers the most intelligent frontier AI models. NVIDIA Blog.

Tags:mixture-of-experts moe transformer sparse-model scaling

Type	Neural network architecture
Key concept	Sparse conditional computation
Introduced	1991 (Jacobs et al.); popularised in LLMs 2022–2025
Notable models	Mixtral 8x7B, GPT-4, DeepSeek-V3, Kimi K2
Related	Transformer architecture, Scaling laws, Inference