- Type
- Neural network training optimisation
- Formats
- FP16 or BF16 combined with FP32
- Benefits
- Faster training, lower memory use
- Key technique
- Loss scaling, FP32 master weights
- Hardware
- GPUs and TPUs with tensor cores
- Related
- Quantisation, GPU cluster, model compression
- Type
- Neural network training optimisation
- Formats
- FP16 or BF16 combined with FP32
- Benefits
- Faster training, lower memory use
- Key technique
- Loss scaling, FP32 master weights
- Hardware
- GPUs and TPUs with tensor cores
- Related
- Quantisation, GPU cluster, model compression
Mixed precision training is a technique for training neural networks that carries out most arithmetic in a lower-precision numerical format, such as 16-bit floating point, while retaining higher precision where it is needed for numerical stability. By reducing the number of bits used to represent numbers during training, it lowers memory consumption and increases throughput on modern accelerators, allowing larger models and batch sizes to be trained faster without a meaningful loss in final accuracy. It has become standard practice in training large deep learning models.
Numerical Formats
Neural networks were traditionally trained using 32-bit floating point, known as FP32 or single precision, which offers a wide range and fine resolution. Mixed precision introduces 16-bit formats for the bulk of the work. Two are common: FP16, or half precision, which has limited range and can underflow or overflow, and BF16, or bfloat16, which keeps the same exponent range as FP32 but with fewer mantissa bits, trading numerical resolution for range and thereby avoiding many of FP16's stability problems. The term mixed precision reflects that the training process deliberately combines these lower-precision formats with FP32 rather than using one format throughout.
How It Works
The central idea is to use low precision for the operations that dominate compute, chiefly the matrix multiplications in forward and backward passes, while using FP32 for operations sensitive to rounding error. A typical implementation keeps a high-precision master copy of the model weights in FP32. During each step, the weights are cast to 16-bit for the forward and backward computations, which run quickly on hardware designed for them, and the resulting gradients are then applied to the FP32 master weights. This preserves the small weight updates that would be lost if accumulated directly in 16-bit.
A further complication arises with FP16, whose limited range can cause small gradient values to underflow to zero. To counter this, mixed precision uses loss scaling: the loss is multiplied by a large factor before backpropagation so that gradients are shifted into the representable range, and the scaling is divided out before the weights are updated. Dynamic loss scaling adjusts this factor automatically during training. BF16, with its wider range, often needs less or no loss scaling, which is one reason it is favoured on hardware that supports it.
Hardware Acceleration
The practical gains of mixed precision come from specialised hardware. Modern GPUs from NVIDIA include tensor cores that perform 16-bit matrix operations at much higher throughput than 32-bit ones, and Google's tensor processing units are built around BF16. On such hardware, mixed precision can roughly double or more the training speed and halve the memory used to store activations and weights, enabling larger models to fit in a given amount of memory. Deep learning frameworks provide automatic mixed precision utilities that apply these techniques with minimal changes to model code.
Significance
Mixed precision training is one of several efficiency techniques, alongside quantisation, distributed training, and memory-saving methods, that together made the training of very large models economically feasible. Whereas quantisation usually refers to compressing a model for inference, mixed precision targets the training phase, and the two are complementary. As model sizes grew, the memory and compute savings from lower-precision arithmetic became essential rather than optional, and research has continued to push toward even lower precision, including 8-bit formats, for parts of the training pipeline.
Mixed precision training matters for Malaysian organisations because it directly reduces the cost of building and adapting AI models on limited hardware budgets. Compute is a significant constraint for local research groups and companies, so techniques that let a given GPU train larger models or process more data per unit of time improve what is achievable. Teams developing Malaysian language models such as MaLLaM and ILMU, and research institutions including MIMOS and universities, rely on such efficiency methods to make fine-tuning and pretraining feasible within available resources.
The technique also shapes the economics of Malaysia's growing AI data-centre sector. Investments by operators such as YTL and the arrival of large hyperscaler and GPU-cloud capacity in the country and the wider Johor region are increasing access to modern accelerators with tensor cores, on which mixed precision delivers its benefits. Efficient use of that hardware supports Malaysia's ambition, under the MyDigital Blueprint and the National AI Office, to host meaningful AI training domestically rather than depending entirely on overseas compute.
Skills to apply these methods are part of the national capability agenda. MDEC and HRD Corp fund machine learning engineering training, and familiarity with mixed precision, distributed training, and memory optimisation is expected of practitioners building serious models. For Malaysian firms in banking, telecommunications, and manufacturing that fine-tune models on proprietary data, keeping training efficient and, where required for PDPA compliance, on domestic infrastructure makes techniques like mixed precision a practical necessity.