AIWiki
Malaysia

GPU Cluster

A GPU cluster is a networked group of servers, each containing one or more graphics processing units, purpose-built to accelerate parallel computation workloads such as deep learning training and large-scale AI inference.

6 min readLast updated June 2026Infrastructure

A GPU cluster is an ensemble of interconnected compute nodes, each equipped with one or more graphics processing units (GPUs), designed to execute massively parallel computational tasks. Originally developed for graphics rendering, GPU clusters became the dominant infrastructure for training large deep learning models from approximately 2012 onwards, and today underpin virtually all large-scale AI research and production deployments.

Architecture

A GPU cluster consists of multiple nodes connected over a high-speed fabric. Each node typically contains one to eight GPUs alongside CPUs, high-bandwidth memory (HBM), local storage, and network interface cards. The nodes are linked by specialised interconnects that enable rapid exchange of gradients and activations during distributed training.

NVIDIA's NVLink and NVSwitch technologies provide high-bandwidth, low-latency GPU-to-GPU communication within a single node, while InfiniBand and Ethernet (with RDMA — Remote Direct Memory Access) connect nodes across the cluster. InfiniBand, with bandwidth up to 400 Gb/s per link in the HDR and NDR generations, is the preferred fabric for the largest training clusters.

Storage infrastructure in GPU clusters must match the I/O throughput of training workloads. High-performance parallel file systems such as IBM Spectrum Scale (GPFS) and Lustre are commonly deployed, alongside object storage tiers for dataset archiving.

Role in AI Training

Training large deep learning models requires enormous computational throughput that no single GPU can provide. Distributed training across a GPU cluster is achieved through two primary parallelism strategies.

Data parallelism divides the training dataset across nodes; each node maintains a copy of the model and processes a different minibatch. Gradients are synchronised across nodes after each forward-backward pass, typically using the AllReduce collective communication operation. This approach scales well when the model fits within a single GPU's memory.

Model parallelism (including tensor parallelism and pipeline parallelism) partitions the model itself across GPUs. This is necessary for models too large for a single device's memory — a category that includes most modern large language models. Frameworks such as NVIDIA's Megatron-LM and DeepSpeed implement sophisticated combinations of data, tensor, and pipeline parallelism to train models with hundreds of billions of parameters.

Hardware Generations

GPU hardware has undergone rapid generational advancement. NVIDIA's H100 (Hopper architecture, 2022) and H200 (2024) GPUs deliver up to 7.8 terabytes per second of memory bandwidth and support the FP8 numerical format that accelerates transformer training. AMD's MI300X accelerator, released in 2023, offers 192 GB of unified HBM3 memory per device, making it competitive for memory-bound workloads.

The largest publicly disclosed training clusters as of 2025 exceeded 100,000 GPUs. Meta has described infrastructure equivalent to approximately 600,000 H100-equivalent GPUs; Microsoft and OpenAI operate similarly scaled systems. The GPU-as-a-service market was valued at approximately USD 4.96 billion in 2025 and is projected to reach USD 31.89 billion by 2034, reflecting the sustained demand for compute.

Cloud and On-Premises Deployments

GPU clusters are available through major cloud providers — Amazon Web Services (via EC2 P4 and P5 instances), Google Cloud (via A3 and A3 Mega instances), Microsoft Azure (via NDv5 series), and Oracle Cloud Infrastructure — as well as GPU-focused cloud providers such as Lambda Labs, CoreWeave, and Together AI. Organisations with sustained, high-volume workloads sometimes invest in dedicated on-premises hardware to reduce long-term costs.

The choice between cloud and on-premises deployment depends on workload predictability, capital expenditure tolerance, data residency requirements, and the lead time for hardware procurement. A GPU cluster purpose-built for a specific architecture can be significantly more efficient than general-purpose cloud instances for large, sustained training programmes.

Challenges

GPU clusters present several operational challenges. Fault tolerance is critical: in a 1,000-node training run, the probability of at least one hardware failure over a 24-hour period is non-trivial. Checkpoint-based recovery — periodically saving model weights to persistent storage — is standard practice, and frameworks such as PyTorch Elastic Training support dynamic reconfiguration of cluster membership.

Power and cooling are major engineering constraints. A single H100 GPU has a thermal design power of 700 watts; a cluster of 10,000 such GPUs requires approximately 7 megawatts of sustained power delivery and commensurate cooling capacity. Data centre design for AI workloads therefore differs substantially from conventional enterprise computing, requiring liquid cooling, high-density power distribution, and specialised networking infrastructure.

Network bandwidth between nodes is frequently the bottleneck in distributed training. The AllReduce gradient synchronisation operation at the end of each training step requires all-to-all communication across the cluster; insufficient network bandwidth degrades scaling efficiency. High-performance clusters therefore invest heavily in low-latency, high-bandwidth fabrics.

References

  1. Shoeybi, M., et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053.
  2. Clarifai. (2024). What Are GPU Clusters and How They Accelerate AI Workloads. clarifai.com.
  3. Scale Computing. (2024). GPU Cluster Explained: Architecture, Nodes and Use Cases. scalecomputing.com.
  4. Epoch AI. (2025). Data on GPU clusters. epoch.ai.
  5. Cyfuture AI. (2026). Top 10 GPU Cluster Services for AI Training and Machine Learning. cyfuture.ai.