Bayesian Inference
Bayesian inference is a statistical method that uses Bayes' theorem to update the probability of a hypothesis as new evidence becomes available, providing a principled framework for reasoning under uncertainty.
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as additional evidence is observed. Unlike frequentist inference, which treats parameters as fixed unknown quantities and data as random, Bayesian inference treats parameters as random variables with their own probability distributions. This perspective provides a coherent framework for reasoning under uncertainty, combining prior knowledge with observed data to produce updated beliefs known as posterior distributions.
Theoretical Foundation
The central equation of Bayesian inference is Bayes' theorem, which expresses the conditional probability of a hypothesis H given evidence E as the product of the likelihood of the evidence under the hypothesis and the prior probability of the hypothesis, divided by the total probability of the evidence. In compact form: posterior is proportional to likelihood times prior. The denominator, called the marginal likelihood or evidence, normalises the result so that it integrates to one over the parameter space.
The prior distribution encodes beliefs about parameters before observing data. It may be informative — incorporating expert knowledge or previous experiments — or weakly informative, providing only loose constraints. The likelihood function describes how probable the observed data are under each candidate value of the parameter. The posterior distribution, obtained by combining prior and likelihood, summarises updated beliefs and serves as the basis for prediction, decision-making, and further analysis.
Computational Approaches
Closed-form solutions to Bayes' theorem exist only for a limited family of conjugate prior–likelihood pairs, such as beta-binomial or normal-normal models. For most practical problems, the posterior must be approximated numerically.
Markov Chain Monte Carlo
Markov chain Monte Carlo (MCMC) methods, including the Metropolis–Hastings algorithm and Gibbs sampling, construct a Markov chain whose stationary distribution is the target posterior. Samples drawn from the chain after a burn-in period approximate the posterior, allowing computation of expectations, credible intervals, and predictive distributions. Hamiltonian Monte Carlo (HMC) and its adaptive variant, the No-U-Turn Sampler (NUTS) used in the Stan probabilistic programming language, exploit gradient information for more efficient exploration of high-dimensional posteriors.
Variational Inference
Variational inference reformulates posterior approximation as an optimisation problem. A simpler distribution from a chosen family — often a factorised Gaussian — is fitted to the true posterior by minimising the Kullback–Leibler divergence between them. Variational methods scale better than MCMC to large datasets and high-dimensional models but provide an approximation rather than asymptotically exact samples.
Laplace Approximation
The Laplace approximation fits a Gaussian distribution centred at the posterior mode, using the curvature of the log-posterior as the precision matrix. It is computationally cheap and often used as an initial approximation or within larger inference pipelines.
Applications in Machine Learning
Bayesian methods underpin a wide range of machine learning techniques. Gaussian processes provide a non-parametric Bayesian framework for regression and classification, returning calibrated uncertainty estimates over predictions. Bayesian neural networks place prior distributions over network weights and approximate the posterior to capture predictive uncertainty — useful in safety-critical applications such as medical diagnosis and autonomous driving. Bayesian optimisation uses a probabilistic surrogate model to guide search over expensive black-box functions, widely applied to hyperparameter tuning of deep learning models.
In probabilistic programming, languages such as Stan, PyMC, NumPyro, and Edward allow practitioners to specify generative models in code and perform inference automatically. These tools have made Bayesian methods accessible to a broader community of data scientists and engineers.
Bayesian vs Frequentist Perspectives
The choice between Bayesian and frequentist approaches has been the subject of long-standing debate in statistics. Frequentist methods rely on long-run frequency interpretations of probability and avoid placing prior distributions on parameters, while Bayesian methods are explicit about prior beliefs and produce probabilistic statements about parameters directly. In practice, the two approaches often yield similar conclusions for well-identified problems with abundant data, but Bayesian methods are particularly valuable when data are scarce, when uncertainty quantification is critical, or when external knowledge must be incorporated formally.
References
- Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman and Hall/CRC.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Hoffman, M. D., and Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.
- Bank Negara Malaysia. (2019). Policy Document on Model Risk Management. Kuala Lumpur: BNM.
- Carpenter, B. et al. (2017). Stan: A Probabilistic Programming Language. Journal of Statistical Software, 76(1).