AIWiki
Malaysia

Diffusion Model

A class of generative AI models that learn to reverse a gradual noise-addition process, enabling the generation of high-quality images, audio, and video from random noise guided by text or other conditioning signals.

7 min readLast updated May 2026Foundations

Diffusion models are a family of generative machine learning models that learn to synthesise realistic data — most notably images, but also audio, video, and three-dimensional structures — by training a neural network to reverse a gradual noise corruption process. The model learns the reverse of a diffusion process: starting from pure random noise, it progressively removes noise over many steps until a coherent data sample emerges. By conditioning this denoising process on a text description, an image, or other inputs, diffusion models can generate outputs that closely match the specified conditions. Since 2022, diffusion models have supplanted earlier generative approaches such as generative adversarial networks (GANs) as the dominant architecture in high-fidelity image generation.[^1]

The Forward Diffusion Process

Training a diffusion model begins with data — typically images from a large corpus — and the definition of a forward process that gradually corrupts that data over a series of timesteps T. At each timestep t, Gaussian noise is added to the image according to a predetermined noise schedule, which specifies how much noise is added at each step. After a sufficiently large number of steps (typically 1,000), the original image has been corrupted to a state of pure Gaussian noise, statistically indistinguishable from random static.

The forward process is not learned; it is a fixed mathematical operation. It exists to generate training examples: pairs of a noisy image at timestep t and the corresponding noise that was added to reach that state.

The Reverse Denoising Process

The neural network in a diffusion model is trained to predict the noise component in a noisy image at any given timestep, or equivalently to predict the original clean image. Given a noisy image at timestep t, the network estimates what the image would look like after removing the noise added in step t, producing a slightly cleaner image. Repeating this denoising operation across all timesteps — from t=T (pure noise) to t=0 — produces a complete, clean generated sample.

The neural network architecture used for this denoising task has evolved. Early diffusion models used a U-Net, a convolutional architecture with skip connections between encoder and decoder layers, which proved effective at capturing multi-scale image structure. Stable Diffusion 3 (2024) and subsequent models replaced the U-Net with a Diffusion Transformer (DiT), applying self-attention across the spatial tokens of a noisy latent representation.[^2]

Latent Diffusion Models

Running the denoising process directly in pixel space is computationally expensive for high-resolution images. Latent diffusion models (LDMs), introduced by Rombach et al. in 2022 and commercialised as Stable Diffusion, address this by performing the diffusion process in a compressed latent space rather than in pixel space.[^3]

A separate encoder (typically a variational autoencoder, or VAE) compresses the input image into a lower-dimensional latent representation. The diffusion process is applied to this latent representation, and a decoder reconstructs the final image from the denoised latent vector. Operating in latent space reduces computational requirements by an order of magnitude, enabling generation at a fraction of the cost of pixel-space diffusion and making the technique practical on consumer GPUs.

Text Conditioning

The ability to generate images from text descriptions is achieved through conditioning the denoising process on a text embedding. The text prompt is encoded using a pre-trained language model or CLIP text encoder, producing a vector representation of the text semantics. This vector is injected into the denoising network at each timestep — typically via cross-attention layers — allowing the network to generate images that align with the specified description.

Classifier-free guidance (CFG), introduced by Ho and Salimans in 2022, is a technique that amplifies text conditioning without a separate classifier network. The model is trained to denoise both with and without the text condition, and at inference time the output is interpolated between the conditional and unconditional predictions with a guidance scale parameter. Higher guidance scales produce images that more closely match the text description at the cost of some diversity.

Notable Implementations

Stable Diffusion is an open-source latent diffusion model developed by Stability AI in collaboration with CompVis and Runway ML, released in August 2022. Its open-source nature enabled a large ecosystem of fine-tuned models, custom lora adapters, and derivative tools. Stable Diffusion 3 (2024) introduced the Multimodal Diffusion Transformer (MMDiT) architecture and flow matching training, substantially improving text rendering and compositional accuracy.

DALL-E 3 (2023), developed by OpenAI, is tightly integrated with ChatGPT and offers high caption-following fidelity, meaning the generated image closely matches the description. It uses a modified latent diffusion architecture conditioned on GPT-4-generated extended captions rather than the raw user prompt.

Midjourney is a commercial image generation service with proprietary architecture, known for aesthetically stylised outputs and a strong community of creative users.

Sora (2024), developed by OpenAI, extends diffusion principles to video generation, operating on spacetime patches of compressed video representations to produce coherent videos up to a minute in length from text descriptions.

Beyond images, diffusion models have been applied to audio generation (Stable Audio, AudioLDM), protein structure generation (RFDiffusion from David Baker's lab), and 3D object generation.

See Also

References

References

  1. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33.
  2. Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Rombach, R. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. Stability AI / Stability AI Technical Report.
  3. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.
  4. Ho, J., & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.