What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

DeepSpeed

DeepSpeed is an open-source deep learning optimisation library developed by Microsoft that enables efficient distributed training and inference of large-scale neural networks through memory and compute optimisations.

6 min readLast updated June 2026Infrastructure

DeepSpeed is an open-source deep learning optimisation library developed by Microsoft Research that makes the training and inference of large-scale neural networks faster, more memory-efficient, and more accessible. Released in 2020 under the Apache 2.0 licence, DeepSpeed has become a standard tool in the large language model training ecosystem, used by researchers and practitioners at academic institutions, AI laboratories, and cloud providers to train models with hundreds of billions of parameters that would otherwise be infeasible on available hardware.

The library is built on PyTorch and introduces a suite of system-level optimisations — including novel memory management strategies, communication compression, and hardware-aware scheduling — that collectively reduce the GPU memory footprint and communication overhead of distributed training by large factors. DeepSpeed is the training framework behind several landmark models, including Megatron-Turing NLG (530 billion parameters), BLOOM-176B, and Microsoft's own Phi series.

Core Innovations

ZeRO: Zero Redundancy Optimizer

The most influential contribution of DeepSpeed is ZeRO (Zero Redundancy Optimizer), a memory optimisation strategy for data-parallel distributed training. In standard data-parallel training, each GPU holds a complete copy of the model parameters, gradients, and optimiser states (such as the momentum and variance estimates maintained by Adam). For large models, this redundancy is enormously wasteful: training a 175 billion parameter model with standard Adam would require roughly 2.8 TB of GPU memory just for parameters and optimiser states — far exceeding the memory of any single GPU.

ZeRO eliminates this redundancy by partitioning parameters, gradients, and optimiser states across data-parallel workers. ZeRO has three stages of progressively aggressive partitioning:

Stage 1 (ZeRO-1) partitions optimiser states across GPUs, reducing optimiser state memory by a factor equal to the number of GPUs. Stage 2 (ZeRO-2) additionally partitions gradients, reducing gradient memory proportionally. Stage 3 (ZeRO-3) partitions model parameters themselves, so each GPU holds only a shard of the full model. Parameters are gathered on-demand during the forward and backward pass, then discarded.

ZeRO-3 enables training models that are orders of magnitude larger than the memory of a single GPU while maintaining near-linear scaling of training throughput with the number of GPUs.

ZeRO-Infinity

ZeRO-Infinity extends the partitioning approach to heterogeneous memory: CPU RAM and NVMe SSDs, which are far cheaper and more abundant than GPU VRAM. By offloading infrequently accessed parameters and optimiser states to CPU or disk, ZeRO-Infinity enables training models with trillions of parameters on commodity GPU clusters that would otherwise require specialised hardware. Data is moved between GPU, CPU, and disk in a demand-driven fashion using optimised NVMe I/O pipelines.

3D Parallelism

DeepSpeed supports three orthogonal forms of parallelism that can be combined into a three-dimensional parallel training strategy:

Data parallelism replicates the model across GPUs and distributes different mini-batches to each replica, aggregating gradients at the end of each step.

Pipeline parallelism partitions the model's layers across GPUs, with each GPU handling a subset of layers. Data flows through the pipeline in micro-batches, keeping all GPUs active most of the time.

Tensor parallelism splits individual layer weight matrices across multiple GPUs, parallelising the matrix multiplications within each layer. This is particularly effective for transformer attention and feedforward layers.

Combining all three forms of parallelism allows DeepSpeed to efficiently scale training to thousands of GPUs with good hardware utilisation.

DeepSpeed-MoE

DeepSpeed includes specialised support for Mixture-of-Experts (MoE) architectures, which achieve large effective model capacity at reduced inference cost by activating only a subset of expert subnetworks for each input token. DeepSpeed-MoE provides efficient routing, expert load balancing, and memory management for MoE models, enabling training and serving of very large sparse models.

DeepSpeed FastGen

Introduced in 2024, DeepSpeed FastGen is a high-throughput inference serving framework. It implements Dynamic SplitFuse, a scheduling algorithm that splits long prompts and fuses short generations to maximise GPU utilisation and reduce latency, particularly for heterogeneous request batches. FastGen achieves significantly higher throughput than standard inference servers such as vLLM or TGI for workloads with mixed prompt lengths.

Impact and Adoption

DeepSpeed has been adopted by research groups and companies worldwide. BLOOM-176B, the first publicly available 176 billion parameter multilingual language model, was trained using DeepSpeed in collaboration with Hugging Face and BigScience. Microsoft's Megatron-Turing NLG (MT-NLG) at 530 billion parameters set records at its release. The Phi series — Microsoft's family of small but capable language models — also uses DeepSpeed for training.

DeepSpeed is available as an Azure Machine Learning managed offering, and integration guides exist for Azure Databricks, allowing organisations to use DeepSpeed without managing their own GPU cluster infrastructure.

Malaysian Context — DeepSpeed Adoption in Malaysian AI Infrastructure

As Malaysian AI research groups and companies begin training language models on Malaysian corpora, DeepSpeed has emerged as a key tool for managing GPU memory and training costs. Universiti Malaya, Universiti Sains Malaysia, and UTM have access to NVIDIA GPU clusters through the MyDigital Blueprint's National AI Infrastructure programme and through research grants from MOSTI (Ministry of Science, Technology and Innovation). These clusters increasingly rely on DeepSpeed to train mid-size language models (3B-13B parameters) that require more GPU memory than a single card can provide.

Mesolitica, a Malaysian AI company building Bahasa Malaysia language models, has used PyTorch-based distributed training frameworks including DeepSpeed to train models on Malaysian text corpora drawn from Malay news, social media, and government documents. The memory efficiency of ZeRO optimisation is practically important for local teams that lack the GPU budgets of large international laboratories.

MDEC's AI Cloud programme, which provides subsidised cloud compute access to Malaysian AI startups and SMEs, includes guidance on using DeepSpeed for efficient training on platforms including AWS (available through the Malaysia-focused data centre regions) and Azure. This lowers the barrier for Malaysian companies to experiment with large model training.

Industrial AI applications in Malaysia — particularly in the semiconductor and electronics manufacturing sector in Penang — are beginning to fine-tune large models for quality inspection, equipment diagnostics, and process optimisation. DeepSpeed's support for efficient fine-tuning (including LoRA integration) makes it relevant for companies with limited GPU resources who need to adapt foundation models to proprietary manufacturing data.

HRD Corp-registered training providers and university computer science departments have begun incorporating distributed training and DeepSpeed into advanced AI engineering curricula, responding to industry demand for engineers who can scale model training beyond a single GPU.

References

Rajbhandari, S., et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC20.
Rajbhandari, S., et al. (2021). ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. SC21.
Microsoft Research. (2022). DeepSpeed: Advancing MoE inference and training to power next-generation AI scale. microsoft.com/research.
Holmes, C., et al. (2024). DeepSpeed FastGen: High-throughput Text Generation for LLMs. arXiv:2401.08671.
BigScience Workshop. (2022). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv:2211.05100.

Tags:DeepSpeed distributed training ZeRO Microsoft MLOps large model training

Developed by	Microsoft (Microsoft Research and Azure AI)
First released	2020
Licence	Apache 2.0 (open source)
Language	Python (PyTorch-based)
Key feature	ZeRO memory optimisation, 3D parallelism
Related	PyTorch, GPU cluster, model serving, MLOps, CUDA