AIWiki
Malaysia

Vision-Language Model

A multimodal AI system that jointly processes and generates information from both images and text, extending large language models with visual perception capabilities through cross-modal alignment.

5 min readLast updated June 2026Models

A vision-language model (VLM) is a multimodal artificial intelligence system capable of jointly interpreting and generating information from both images and natural language text. VLMs extend the capabilities of large language models (LLMs) — which are limited to text — by incorporating visual perception, enabling applications that require understanding the content of images, charts, screenshots, documents, and video frames alongside linguistic reasoning.

The development of VLMs represents one of the most significant recent advances in AI, enabling machines to perform tasks previously requiring either separate vision and language pipelines or extensive task-specific engineering. As of 2025, the leading frontier models — GPT-4V (OpenAI), Gemini 2.5 Pro (Google DeepMind), and Claude (Anthropic) — are all VLMs capable of discussing images with high fidelity.

Architecture

VLMs generally follow a modular three-component design.

The vision encoder processes an input image and produces a sequence of visual feature vectors. The dominant choice is a Vision Transformer (ViT), which divides the image into fixed-size patches, linearly projects each patch into a token embedding, and processes the sequence through Transformer self-attention layers. Models such as CLIP's ViT-L/14 and ViT-G/14 are widely used as vision encoders because they have been pretrained on image-text pairs and produce semantically rich visual representations.

The projector (also called a connector or adapter) bridges the vision encoder and the language model by mapping visual feature vectors into the same embedding space as language tokens. Common projector designs include a multi-layer perceptron (MLP), a cross-attention layer, or a Q-Former as used in BLIP-2.

The language model decoder receives the combined sequence of visual and textual tokens and generates the output response autoregressively. State-of-the-art VLMs use instruction-tuned decoder-only Transformers such as Llama or proprietary models as their language backbone.

Training Paradigm

VLMs are typically trained in multiple stages. First, the vision encoder is pretrained on large image or image-text datasets. Second, the projector is trained on image-text pairs to align visual and linguistic representations, usually keeping the vision encoder and language model frozen. Third, the full model is fine-tuned on visual instruction-following datasets to develop conversational capabilities. Finally, RLHF or direct preference optimisation (DPO) may be applied to align outputs with human preferences.

CLIP (Contrastive Language-Image Pretraining, OpenAI, 2021) pioneered large-scale image-text alignment using a contrastive objective across 400 million image-caption pairs, producing a vision encoder capable of matching images to text descriptions. CLIP's visual representations became the de facto starting point for subsequent VLMs.

LLaVA (Visual Instruction Tuning, Liu et al., 2023) introduced a simple and influential VLM architecture connecting a CLIP vision encoder to an LLaMA language model via a linear projector, trained on GPT-4-generated visual instruction data. LLaVA demonstrated that high-quality multimodal instruction-following could be achieved with relatively modest training compute.

Capabilities and Benchmarks

VLMs are evaluated on a range of visual reasoning benchmarks. VQAv2 tests factual question answering about images. MMBench, MMMU, and MMStar assess broader multimodal understanding including charts, diagrams, and multi-image reasoning. TextVQA and DocVQA evaluate reading and understanding text within images. ScienceQA tests multimodal scientific reasoning.

Major 2025 VLMs demonstrate strong performance across these benchmarks. Gemini 2.5 Pro supports over one million token context windows, enabling video and long document understanding. Qwen2.5-VL (Alibaba) is a competitive open-weight VLM with strong performance on document and chart understanding.

Applications

In healthcare, VLMs analyse medical images — radiology scans, pathology slides, dermatology images — alongside clinical notes, enabling AI-assisted diagnosis and report generation. In document understanding, VLMs process scanned documents, forms, invoices, and tables, extracting structured information without requiring separate OCR pipelines. In robotics and embodied AI, VLMs serve as the perception and reasoning backbone for robots that must interpret visual scenes and follow natural language instructions. In accessibility, VLMs generate natural language descriptions of images for visually impaired users.

See Also

References

References

  1. Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. ICML 2021. OpenAI.
  2. Liu, H., et al. (2023). Visual instruction tuning. NeurIPS 2023.
  3. Li, J., et al. (2023). BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML 2023.
  4. Hugging Face. (2024). Vision language models explained. huggingface.co/blog/vlms.
  5. Wikipedia. (2025). Vision-language model. en.wikipedia.org.