AIWiki
Malaysia

CLIP

CLIP (Contrastive Language-Image Pre-training) is a multimodal neural network model developed by OpenAI that learns visual concepts from natural language descriptions by jointly training an image encoder and a text encoder on 400 million image-text pairs.

5 min readLast updated June 2026Models

CLIP (Contrastive Language-Image Pre-training) is a multimodal neural network model developed by OpenAI and published in January 2021. CLIP learns rich visual and semantic representations by training jointly on images and their natural language descriptions, enabling it to associate visual content with textual concepts without being explicitly trained on any particular classification task. Its ability to perform zero-shot image classification — correctly categorising images it has never seen during supervised training — marked a significant step towards more general visual intelligence.

Architecture

CLIP consists of two encoder networks that operate in parallel: an image encoder and a text encoder. The image encoder processes input images and produces a vector embedding summarising the visual content. The text encoder processes natural language descriptions and produces a corresponding vector embedding in the same shared latent space.

For the image encoder, OpenAI experimented with two families of architectures: ResNet variants (CNN-based) and Vision Transformers (ViT-based). The ViT-based variants proved more computationally efficient at scale. For the text encoder, a standard Transformer architecture similar to GPT-2 is used, with the embedding of the end-of-text token serving as the text representation.

The key design principle is that matching image-text pairs have embeddings close together in the shared latent space, while non-matching pairs are pushed apart. This is achieved using a contrastive loss function applied to a batch of image-text pairs: for each image in the batch, the model maximises the cosine similarity with its corresponding text description and minimises similarity with all other text descriptions in the batch.

Training Data

CLIP was trained on WebImageText (WIT), a dataset of approximately 400 million (image, text) pairs collected from publicly available internet sources. Unlike earlier image datasets such as ImageNet, which required extensive manual labelling, WIT was assembled through large-scale web crawling. This approach enabled training at a scale previously impractical for supervised image classification datasets.

Zero-Shot Classification

One of CLIP's most notable capabilities is zero-shot classification: categorising images into classes the model was never explicitly trained to distinguish. This is achieved by constructing text prompts for each candidate label — for example, "a photo of a dog" — and computing the cosine similarity between the image embedding and each text prompt embedding. The label whose text embedding is closest to the image embedding is selected as the predicted class.

On ImageNet zero-shot evaluation, CLIP achieved accuracy comparable to a supervised ResNet-50 trained directly on ImageNet, without having been trained on any ImageNet images. This demonstrated that natural language supervision can be a viable alternative or complement to manually labelled classification datasets.

Applications and Downstream Uses

CLIP embeddings have become a standard component in multimodal AI pipelines. DALL-E 2 uses a CLIP image embedding as the conditioning signal for image generation. Stable Diffusion and many other diffusion-based models use CLIP text encoders to condition image generation on natural language prompts.

Beyond generation, CLIP embeddings are used for image-text retrieval (finding images that match a text query), visual question answering, content moderation, and multimodal search. The contrastive learning paradigm introduced by CLIP has influenced a broad family of subsequent multimodal models, including BLIP, BLIP-2, SigLIP, and MetaCLIP.

Limitations

Despite its versatility, CLIP has notable limitations. It struggles with fine-grained classification tasks requiring subtle visual distinctions (for example, telling apart car models). It tends to rely on texture and high-level semantic content rather than fine spatial relationships. CLIP also inherits biases present in internet-scraped training data, including stereotyped associations between demographic groups and certain labels.

See Also

References

  1. Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021. OpenAI.
  2. Viso AI. (2024). CLIP: Contrastive Language-Image Pre-Training. viso.ai.
  3. Zhai, X. et al. (2023). Sigmoid Loss for Language Image Pre-Training (SigLIP). ICCV 2023. Google.
  4. Hugging Face. (2024). CLIP: Contrastive Language-Image Pretraining. Hugging Face Computer Vision Course.
  5. OpenAI. (2021). CLIP: Connecting text and images. openai.com.