What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

CLIP

CLIP (Contrastive Language-Image Pre-training) is a multimodal neural network model developed by OpenAI that learns visual concepts from natural language descriptions by jointly training an image encoder and a text encoder on 400 million image-text pairs.

5 min readLast updated June 2026Models

CLIP (Contrastive Language-Image Pre-training) is a multimodal neural network model developed by OpenAI and published in January 2021. CLIP learns rich visual and semantic representations by training jointly on images and their natural language descriptions, enabling it to associate visual content with textual concepts without being explicitly trained on any particular classification task. Its ability to perform zero-shot image classification — correctly categorising images it has never seen during supervised training — marked a significant step towards more general visual intelligence.

Architecture

CLIP consists of two encoder networks that operate in parallel: an image encoder and a text encoder. The image encoder processes input images and produces a vector embedding summarising the visual content. The text encoder processes natural language descriptions and produces a corresponding vector embedding in the same shared latent space.

For the image encoder, OpenAI experimented with two families of architectures: ResNet variants (CNN-based) and Vision Transformers (ViT-based). The ViT-based variants proved more computationally efficient at scale. For the text encoder, a standard Transformer architecture similar to GPT-2 is used, with the embedding of the end-of-text token serving as the text representation.

The key design principle is that matching image-text pairs have embeddings close together in the shared latent space, while non-matching pairs are pushed apart. This is achieved using a contrastive loss function applied to a batch of image-text pairs: for each image in the batch, the model maximises the cosine similarity with its corresponding text description and minimises similarity with all other text descriptions in the batch.

Training Data

CLIP was trained on WebImageText (WIT), a dataset of approximately 400 million (image, text) pairs collected from publicly available internet sources. Unlike earlier image datasets such as ImageNet, which required extensive manual labelling, WIT was assembled through large-scale web crawling. This approach enabled training at a scale previously impractical for supervised image classification datasets.

Zero-Shot Classification

One of CLIP's most notable capabilities is zero-shot classification: categorising images into classes the model was never explicitly trained to distinguish. This is achieved by constructing text prompts for each candidate label — for example, "a photo of a dog" — and computing the cosine similarity between the image embedding and each text prompt embedding. The label whose text embedding is closest to the image embedding is selected as the predicted class.

On ImageNet zero-shot evaluation, CLIP achieved accuracy comparable to a supervised ResNet-50 trained directly on ImageNet, without having been trained on any ImageNet images. This demonstrated that natural language supervision can be a viable alternative or complement to manually labelled classification datasets.

Applications and Downstream Uses

CLIP embeddings have become a standard component in multimodal AI pipelines. DALL-E 2 uses a CLIP image embedding as the conditioning signal for image generation. Stable Diffusion and many other diffusion-based models use CLIP text encoders to condition image generation on natural language prompts.

Beyond generation, CLIP embeddings are used for image-text retrieval (finding images that match a text query), visual question answering, content moderation, and multimodal search. The contrastive learning paradigm introduced by CLIP has influenced a broad family of subsequent multimodal models, including BLIP, BLIP-2, SigLIP, and MetaCLIP.

Limitations

Despite its versatility, CLIP has notable limitations. It struggles with fine-grained classification tasks requiring subtle visual distinctions (for example, telling apart car models). It tends to rely on texture and high-level semantic content rather than fine spatial relationships. CLIP also inherits biases present in internet-scraped training data, including stereotyped associations between demographic groups and certain labels.

Malaysian Context — CLIP and Multimodal AI in Malaysia

CLIP and CLIP-derived models have become relevant to several Malaysian AI application areas, particularly involving image search, content moderation, and retail technology. The model's ability to bridge natural language and visual content without task-specific labelled data makes it attractive for organisations with limited annotation budgets.

Malaysian e-commerce platforms, including Shopee Malaysia and Lazada Malaysia, have explored CLIP-based visual search capabilities allowing consumers to search product catalogues using images rather than text queries. Visual similarity search powered by CLIP embeddings enables "find similar products" features particularly useful for fashion, furniture, and consumer electronics categories.

In the media and content industry, Malaysian digital publishers and social media platforms have applied CLIP-based content moderation systems to detect policy-violating image content across Malay and English-captioned posts. The model's multimodal capability allows filtering based on both image content and accompanying text simultaneously.

Malaysian universities including Universiti Putra Malaysia (UPM) and Universiti Kebangsaan Malaysia (UKM) have published research applying CLIP to agricultural and environmental monitoring tasks, including identifying plant species and land-use categories from aerial imagery of the Malaysian landscape. These applications benefit from CLIP's ability to generalise to new visual categories through natural language specification.

MDEC and the National AI Office have noted multimodal AI as a priority area within Malaysia's AI development agenda. Training programmes offered through HRD Corp-registered providers have begun incorporating multimodal model concepts, including CLIP-style contrastive pre-training, into AI engineer upskilling curricula.

References

Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021. OpenAI.
Viso AI. (2024). CLIP: Contrastive Language-Image Pre-Training. viso.ai.
Zhai, X. et al. (2023). Sigmoid Loss for Language Image Pre-Training (SigLIP). ICCV 2023. Google.
Hugging Face. (2024). CLIP: Contrastive Language-Image Pretraining. Hugging Face Computer Vision Course.
OpenAI. (2021). CLIP: Connecting text and images. openai.com.

Tags:clip multimodal openai image-text contrastive-learning

Full name	Contrastive Language-Image Pre-training
Developed by	OpenAI
Released	January 2021
Training data	400 million image-text pairs (WIT)
Key use	Zero-shot image classification, image-text retrieval
Related	DALL-E, Stable Diffusion, Vision Transformer, BLIP