Mixture of Experts
Mixture of Experts (MoE) is a machine learning architecture in which a model routes each input to a small subset of specialised sub-networks called experts, enabling large model capacity at a fraction of the compute cost.
Mixture of Experts (MoE) is a machine learning architecture in which a model is composed of multiple specialised sub-networks, each called an expert, together with a gating network that selects which experts process a given input. Instead of routing every input through every parameter in the model, only a small fraction of experts are activated per token or per sample, allowing the overall parameter count to grow substantially without a proportional increase in computation.
Historical Background
The concept of mixing specialised experts dates to work by Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton in 1991. Their original formulation used a gating network to compute a weighted average over competing networks trained on different regions of the input space. For decades, MoE remained a relatively niche research technique applied to comparatively small models.
Interest was revived when Google researchers applied MoE to transformer models. The Sparsely-Gated Mixture-of-Experts layer (2017) demonstrated that MoE could be applied to language modelling at scale, offering orders-of-magnitude increases in parameter count with manageable inference costs. The Switch Transformer (2021) and subsequent GLaM model pushed this further, establishing MoE as a serious approach for large language models.
Architecture
In a modern MoE transformer, the feed-forward network (FFN) sublayer inside each transformer block is replaced by a set of N expert FFNs. A lightweight router network — typically a learned linear projection followed by a softmax — assigns a score to each expert for a given token. The top-K experts by score are selected and their outputs are combined, usually via a weighted sum proportional to the gating scores.
A model described as having N experts with top-K routing activates K of N experts per token. DeepSeek-V3, for example, has 671 billion total parameters but activates only 37 billion during inference by using top-K routing among 256 experts. Kimi K2, released in 2025, extends this to 1 trillion total parameters with 384 experts and top-8 routing.
The attention sublayers in MoE transformers typically remain dense; sparsity applies to the FFN sublayers only. This hybrid design preserves the global attention capacity that makes transformers effective at long-range reasoning while achieving compute savings through sparse expert routing.
Load Balancing
A persistent challenge in MoE training is expert collapse, where the router learns to send most tokens to a small number of experts, leaving the remainder underutilised. Modern MoE implementations address this through auxiliary load-balancing losses that penalise unequal expert utilisation, or through techniques such as expert-choice routing, where each expert selects its own top-K tokens rather than each token selecting its own top-K experts.
Efficiency and Scaling
The appeal of MoE is that it decouples model capacity from compute. A dense model with P parameters uses all P parameters for every token during both training and inference. An MoE model with the same nominal parameter count uses only a fraction of them per forward pass, resulting in lower floating-point operations per token while retaining access to the full knowledge capacity of all experts.
In practice, MoE models often achieve better perplexity per training FLOP than comparably sized dense models. DeepSeek-V3 was trained at a reported cost of approximately USD 5.5 million, significantly less than would be required for a comparably capable dense model. By 2025, MoE architectures accounted for the majority of open-source frontier model releases.
The trade-off is infrastructure complexity. MoE models require more GPU memory to load all experts, even those not active for a given batch. At high batch sizes the per-token savings accumulate, but at very low batch sizes the memory overhead can outweigh the compute savings. Serving MoE models efficiently requires careful attention to expert parallelism, routing overhead, and communication costs in distributed inference clusters.
Comparison with Dense Models
| Property | Dense Model | MoE Model | |---|---|---| | Parameters used per token | All | Top-K experts only | | Training compute per token | High | Lower | | Inference memory | Proportional to active params | Full model must be loaded | | Expert specialisation | None | Implicit, emergent | | Implementation complexity | Low | Higher |
Applications
MoE architectures are used primarily in large language models, where scaling model capacity is beneficial but compute budgets are constrained. They have also been applied to multimodal models combining vision and language, and to mixture-of-modality routing where different experts handle different input types. Research in 2025 explored applying MoE principles to diffusion models and to the attention sublayer itself through sparse attention routing.
See Also
References
- Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87.
- Shazeer, N. et al. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. Proceedings of ICLR 2017.
- Fedus, W., Zoph, B., and Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120).
- DeepSeek-AI. (2024). DeepSeek-V3 technical report. arXiv:2412.19437.
- NVIDIA. (2025). Mixture of Experts powers the most intelligent frontier AI models. NVIDIA Blog.