AIWiki
Malaysia

Gaussian Process

A non-parametric Bayesian model that defines a distribution over functions, widely used in regression, optimisation, and uncertainty quantification.

6 min readLast updated June 2026Foundations

A Gaussian process (GP) is a stochastic process in which any finite collection of random variables has a joint multivariate Gaussian distribution. Treated as a prior over functions, a GP provides a principled non-parametric Bayesian framework for regression and classification in which predictions come equipped with calibrated uncertainty estimates. GPs are fully specified by a mean function, which encodes prior beliefs about the average behaviour of the unknown function, and a covariance function, or kernel, which encodes assumptions about smoothness, periodicity, and length-scale. GPs are foundational tools in geostatistics, where they appear under the name kriging, and in modern Bayesian machine learning.

Mathematical formulation

A function f(x) is said to be drawn from a Gaussian process if, for any finite set of input points x_1 through x_N, the random vector of values f(x_1) through f(x_N) follows a multivariate Gaussian distribution. The distribution is determined by a mean function m(x) and a covariance kernel k(x, x') that returns the covariance between f(x) and f(x'). Conditioning the prior on observed training data yields a posterior that is again Gaussian and that gives, in closed form, the predictive mean and variance at any new input. The posterior mean interpolates the observations smoothly, and the predictive variance shrinks near observed points and grows in regions where the model is uncertain.

Covariance kernels

The kernel is the central modelling choice in a GP. The most common kernel is the squared exponential, also called the radial basis function, which produces infinitely differentiable sample paths and is parameterised by a length-scale and a signal variance. The Matérn family generalises the squared exponential and allows control over function smoothness via a parameter that interpolates between exponential and infinitely smooth kernels. Periodic kernels capture repeating structure, linear kernels recover Bayesian linear regression, and composite kernels formed by sums and products allow rich functional forms. Automatic relevance determination assigns a separate length-scale to each input dimension, providing a soft form of feature selection.

Inference and scalability

Exact GP inference requires inverting an N by N kernel matrix, where N is the number of training observations, giving cubic time and quadratic memory complexity. This becomes prohibitive beyond a few thousand points. A large body of approximate methods has been developed to extend GPs to larger datasets, including sparse approximations based on inducing points (such as the FITC and VFE methods), variational inference, structured kernel interpolation, and stochastic variational GPs that admit minibatch training. Deep kernel learning combines GPs with neural network feature extractors, and GPyTorch and GPJax provide GPU-accelerated implementations.

Bayesian optimisation

One of the most influential applications of GPs is Bayesian optimisation, in which a GP surrogate of an expensive black-box objective function is updated as evaluations are observed, and an acquisition function — such as expected improvement or upper confidence bound — selects the next query point by trading exploration against exploitation. Bayesian optimisation has become the standard method for hyperparameter tuning of machine learning models, materials discovery, A/B test design, and experimental optimisation in chemistry and biology. Frameworks such as BoTorch, GPyOpt, and Vizier implement GP-based Bayesian optimisation at scale.

Other applications

In geostatistics, GPs known as kriging models are used to interpolate spatial fields such as mineral concentrations, rainfall, and temperature. In robotics, GPs are used for inverse dynamics modelling and trajectory optimisation. In aerospace and engineering, GPs serve as surrogate models for expensive computer simulations. In epidemiology and finance, GPs provide flexible models for time series with calibrated uncertainty. GP classification, while less tractable than regression because of non-Gaussian likelihoods, is widely used in Bayesian deep learning research and in active learning settings.

Relationship to neural networks

Gaussian processes have a deep connection to neural networks. Radford Neal showed in the 1990s that an infinitely wide single-hidden-layer neural network with Gaussian-distributed weights converges to a GP, an observation later extended to infinitely wide deep networks under the neural network Gaussian process and neural tangent kernel theories. These results give GPs an important theoretical role in understanding the behaviour of large neural networks and in providing tractable analogues for analysis.

See Also

References

References

  1. Rasmussen, C. E., and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.
  2. Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer.
  3. Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. NeurIPS.
  4. Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian Processes for Big Data. UAI.