AIWiki
Malaysia

Stable Diffusion

Stable Diffusion is an open-source latent diffusion model developed by Stability AI that generates high-quality images from text prompts, running efficiently on consumer-grade hardware.

5 min readLast updated May 2026Models

Stable Diffusion is a latent diffusion model developed by Stability AI in collaboration with researchers at Ludwig Maximilian University of Munich and RunwayML, first released publicly in August 2022. It is one of the most widely adopted open-source generative AI models, capable of producing photorealistic and artistic images from natural language text prompts. Unlike many contemporary image generation systems that operate exclusively through cloud APIs, Stable Diffusion can run on consumer-grade graphics processing units (GPUs), making it accessible to individual developers, artists, and researchers worldwide.

By 2024, an estimated 80 percent of all AI-generated images were produced using Stable Diffusion-based tools or derivatives, reflecting the model's dominant position in the open-source image generation ecosystem.

Architecture

Stable Diffusion is built on the latent diffusion model (LDM) architecture, which performs the diffusion process in a compressed latent space rather than directly in pixel space. This design choice is the key technical innovation that makes the model computationally practical.

The model consists of three primary components. The variational autoencoder (VAE) compresses input images into a lower-dimensional latent representation and reconstructs images from latent codes. The U-Net, a convolutional architecture with attention layers, learns to iteratively denoise latent representations during both training and inference. The text encoder, typically a CLIP (Contrastive Language-Image Pre-training) model developed by OpenAI, converts text prompts into numerical embeddings that guide the denoising process.

Forward and Reverse Diffusion

During training, the forward diffusion process gradually adds Gaussian noise to image latents over a fixed number of timesteps until the latent is indistinguishable from pure random noise. The model then learns to reverse this process, predicting and subtracting the noise at each step to recover the original latent. This is inspired by principles from non-equilibrium thermodynamics and has strong theoretical connections to score-based generative models.

During inference, generation begins from a random noise tensor. The U-Net iteratively denoises this tensor over many timesteps (typically 20 to 50 steps), guided by the text embedding, until a coherent image latent emerges. The VAE decoder then maps this latent back to pixel space to produce the final image.

Versions and Variants

Stability AI has released multiple versions of the model since 2022. Stable Diffusion 1.5, released in October 2022, became the most widely used base model due to its balance of quality and speed. Stable Diffusion 2.0 and 2.1 improved resolution handling and updated the text encoder to OpenCLIP. Stable Diffusion XL (SDXL), released in 2023, introduced a two-stage architecture producing higher-quality images at 1024x1024 resolution. Stable Diffusion 3 and 3.5, released in 2024, incorporated a multimodal diffusion transformer architecture with improved text rendering and composition.

Capabilities and Use Cases

The model supports several primary use cases. Text-to-image generation creates images from a written prompt describing the desired content. Image-to-image translation transforms an existing image guided by a prompt, blending the source image structure with the prompt's semantic content. Inpainting allows selective regeneration of masked regions within an image. ControlNet, a popular extension, enables precise spatial control over generated images using edge maps, depth maps, or human pose skeletons.

Fine-tuning methods including DreamBooth and textual inversion allow users to train the model on a small number of custom images, enabling it to generate novel depictions of specific subjects, styles, or objects.

Open-Source Ecosystem

The open-source release of Stable Diffusion's weights in 2022 catalysed a large ecosystem of community tools and applications. Automatic1111's AUTOMATIC1111 webUI became the dominant local interface, offering extensive controls and a plugin system. ComfyUI, a node-based workflow tool, provides more flexible pipeline construction. Hugging Face hosts thousands of fine-tuned model variants and LoRA (Low-Rank Adaptation) weights contributed by the community.

This open ecosystem contrasts with the closed API models such as DALL-E and Midjourney, enabling researchers and developers to study, modify, and redistribute the model and its derivatives.

See Also

References

  1. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. Proceedings of CVPR 2022.
  2. Stability AI. (2022). Stable Diffusion Public Release. stability.ai.
  3. Podell, D. et al. (2023). SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952.
  4. Ruiz, N. et al. (2023). DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. Proceedings of CVPR 2023.
  5. MDEC. (2024). Digital Creative Industry Report. Malaysia Digital Economy Corporation, Kuala Lumpur.