AIWiki
Malaysia

Variational Autoencoder

A variational autoencoder is a generative neural network that learns a probabilistic latent representation of data, enabling smooth sampling and reconstruction of new examples.

5 min readLast updated May 2026Foundations

A variational autoencoder (VAE) is a class of generative deep learning model that learns a continuous, probabilistic latent representation of input data and uses it to reconstruct or synthesise new samples. Introduced in 2013 by Diederik P. Kingma and Max Welling, VAEs combine ideas from variational inference, Bayesian statistics, and [[neural-network]] architectures. They are widely used for image generation, anomaly detection, representation learning, drug discovery, and as components inside larger systems such as latent diffusion pipelines.

Background

Classical autoencoders are neural networks trained to compress an input into a low-dimensional code and then reconstruct it from that code. They are useful for dimensionality reduction and denoising but lack a probabilistic semantics, so their latent space is generally not suitable for sampling new data. The VAE formulates the encoder and decoder probabilistically: the encoder outputs the parameters of a distribution over latent variables, and the decoder defines a likelihood over reconstructed data given a latent sample. Training optimises the evidence lower bound (ELBO), a tractable surrogate for the marginal likelihood.

Mathematical formulation

Given input x, the VAE assumes a generative process in which a latent code z is drawn from a prior p(z), typically a standard normal N(0, I), and the data are generated from a conditional distribution p_theta(x | z) parameterised by a neural network (the decoder). An approximate posterior q_phi(z | x), the encoder, is also parameterised by a neural network. Training maximises the ELBO:

ELBO(x) = E_q(z|x) [ log p_theta(x | z) ] - KL( q_phi(z | x) || p(z) )

The first term is a reconstruction likelihood that rewards faithful decoding, and the second is a Kullback–Leibler divergence that regularises the latent distribution toward the prior. To make backpropagation feasible through the stochastic sampling step, the reparameterisation trick rewrites z = mu + sigma * epsilon, where epsilon is drawn from a fixed N(0, I) distribution and mu and sigma are produced by the encoder.

Variants

| Variant | Modification | Typical purpose | | --- | --- | --- | | Beta-VAE | Weights the KL term by a hyperparameter beta | Encourages disentangled latent factors | | Conditional VAE (CVAE) | Conditions on an auxiliary label or context | Class-conditional generation | | VQ-VAE | Replaces continuous latents with a learned codebook | Discrete codes; used in audio and image tokenisation | | Hierarchical VAE | Stacks multiple latent layers | Captures structure at multiple scales | | Disentangled VAE | Additional structural priors on latent dimensions | Interpretability research |

VQ-VAE, introduced by van den Oord and colleagues at DeepMind, has been particularly influential because it produces discrete tokens that can be modelled with autoregressive transformers, an approach used in audio codecs and image tokenisers.

Comparison with other generative models

VAEs sit alongside [[generative-adversarial-network]] (GANs), [[diffusion-model]] systems, normalising flows, and autoregressive models. Each family has trade-offs. GANs often produce sharper images but are harder to train and lack an explicit likelihood. Diffusion models typically achieve the strongest image quality at large scale but require many sampling steps. VAEs are stable to train and provide a principled likelihood-based framework, but unconditional VAE samples can appear blurry compared with GANs or diffusion outputs. VAEs are therefore frequently used as building blocks: most modern image diffusion systems, including latent diffusion variants, use a VAE to encode images into a compact latent space before the diffusion process operates.

Applications

VAEs have been applied to image and audio generation, molecular design (where they help search the space of possible drug-like molecules), recommendation systems, time-series anomaly detection, and as compressors in lossy media codecs. In language and speech, VQ-VAE-style models tokenise raw waveforms or spectrograms for downstream transformer modelling. In scientific machine learning, VAEs are used to learn low-dimensional representations of physical simulations and microscopy data.

Limitations

Researchers have documented several pathologies in VAE training, including posterior collapse (where the latent variable is ignored), the trade-off between reconstruction quality and prior-matching, and difficulty modelling sharp high-frequency detail. Variants such as Beta-VAE, free-bits training, and hybrid VAE/GAN approaches mitigate but do not fully resolve these issues. Despite their limitations, VAEs remain a foundational tool in the generative modelling toolkit and a frequent starting point for research and education.

References

  1. Kingma, D. P. and Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv:1312.6114.
  2. Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML.
  3. van den Oord, A., Vinyals, O., and Kavukcuoglu, K. (2017). Neural Discrete Representation Learning. NeurIPS.
  4. Kingma, D. P. and Welling, M. (2019). An Introduction to Variational Autoencoders. Foundations and Trends in Machine Learning.