AIWiki
Malaysia

DeepSpeed

DeepSpeed is an open-source deep learning optimisation library developed by Microsoft that enables efficient distributed training and inference of large-scale neural networks through memory and compute optimisations.

6 min readLast updated June 2026Infrastructure

DeepSpeed is an open-source deep learning optimisation library developed by Microsoft Research that makes the training and inference of large-scale neural networks faster, more memory-efficient, and more accessible. Released in 2020 under the Apache 2.0 licence, DeepSpeed has become a standard tool in the large language model training ecosystem, used by researchers and practitioners at academic institutions, AI laboratories, and cloud providers to train models with hundreds of billions of parameters that would otherwise be infeasible on available hardware.

The library is built on PyTorch and introduces a suite of system-level optimisations — including novel memory management strategies, communication compression, and hardware-aware scheduling — that collectively reduce the GPU memory footprint and communication overhead of distributed training by large factors. DeepSpeed is the training framework behind several landmark models, including Megatron-Turing NLG (530 billion parameters), BLOOM-176B, and Microsoft's own Phi series.

Core Innovations

ZeRO: Zero Redundancy Optimizer

The most influential contribution of DeepSpeed is ZeRO (Zero Redundancy Optimizer), a memory optimisation strategy for data-parallel distributed training. In standard data-parallel training, each GPU holds a complete copy of the model parameters, gradients, and optimiser states (such as the momentum and variance estimates maintained by Adam). For large models, this redundancy is enormously wasteful: training a 175 billion parameter model with standard Adam would require roughly 2.8 TB of GPU memory just for parameters and optimiser states — far exceeding the memory of any single GPU.

ZeRO eliminates this redundancy by partitioning parameters, gradients, and optimiser states across data-parallel workers. ZeRO has three stages of progressively aggressive partitioning:

Stage 1 (ZeRO-1) partitions optimiser states across GPUs, reducing optimiser state memory by a factor equal to the number of GPUs. Stage 2 (ZeRO-2) additionally partitions gradients, reducing gradient memory proportionally. Stage 3 (ZeRO-3) partitions model parameters themselves, so each GPU holds only a shard of the full model. Parameters are gathered on-demand during the forward and backward pass, then discarded.

ZeRO-3 enables training models that are orders of magnitude larger than the memory of a single GPU while maintaining near-linear scaling of training throughput with the number of GPUs.

ZeRO-Infinity

ZeRO-Infinity extends the partitioning approach to heterogeneous memory: CPU RAM and NVMe SSDs, which are far cheaper and more abundant than GPU VRAM. By offloading infrequently accessed parameters and optimiser states to CPU or disk, ZeRO-Infinity enables training models with trillions of parameters on commodity GPU clusters that would otherwise require specialised hardware. Data is moved between GPU, CPU, and disk in a demand-driven fashion using optimised NVMe I/O pipelines.

3D Parallelism

DeepSpeed supports three orthogonal forms of parallelism that can be combined into a three-dimensional parallel training strategy:

Data parallelism replicates the model across GPUs and distributes different mini-batches to each replica, aggregating gradients at the end of each step.

Pipeline parallelism partitions the model's layers across GPUs, with each GPU handling a subset of layers. Data flows through the pipeline in micro-batches, keeping all GPUs active most of the time.

Tensor parallelism splits individual layer weight matrices across multiple GPUs, parallelising the matrix multiplications within each layer. This is particularly effective for transformer attention and feedforward layers.

Combining all three forms of parallelism allows DeepSpeed to efficiently scale training to thousands of GPUs with good hardware utilisation.

DeepSpeed-MoE

DeepSpeed includes specialised support for Mixture-of-Experts (MoE) architectures, which achieve large effective model capacity at reduced inference cost by activating only a subset of expert subnetworks for each input token. DeepSpeed-MoE provides efficient routing, expert load balancing, and memory management for MoE models, enabling training and serving of very large sparse models.

DeepSpeed FastGen

Introduced in 2024, DeepSpeed FastGen is a high-throughput inference serving framework. It implements Dynamic SplitFuse, a scheduling algorithm that splits long prompts and fuses short generations to maximise GPU utilisation and reduce latency, particularly for heterogeneous request batches. FastGen achieves significantly higher throughput than standard inference servers such as vLLM or TGI for workloads with mixed prompt lengths.

Impact and Adoption

DeepSpeed has been adopted by research groups and companies worldwide. BLOOM-176B, the first publicly available 176 billion parameter multilingual language model, was trained using DeepSpeed in collaboration with Hugging Face and BigScience. Microsoft's Megatron-Turing NLG (MT-NLG) at 530 billion parameters set records at its release. The Phi series — Microsoft's family of small but capable language models — also uses DeepSpeed for training.

DeepSpeed is available as an Azure Machine Learning managed offering, and integration guides exist for Azure Databricks, allowing organisations to use DeepSpeed without managing their own GPU cluster infrastructure.

See Also

References

  1. Rajbhandari, S., et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC20.
  2. Rajbhandari, S., et al. (2021). ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. SC21.
  3. Microsoft Research. (2022). DeepSpeed: Advancing MoE inference and training to power next-generation AI scale. microsoft.com/research.
  4. Holmes, C., et al. (2024). DeepSpeed FastGen: High-throughput Text Generation for LLMs. arXiv:2401.08671.
  5. BigScience Workshop. (2022). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv:2211.05100.