What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Vision-Language Model

A multimodal AI system that jointly processes and generates information from both images and text, extending large language models with visual perception capabilities through cross-modal alignment.

5 min readLast updated June 2026Models

A vision-language model (VLM) is a multimodal artificial intelligence system capable of jointly interpreting and generating information from both images and natural language text. VLMs extend the capabilities of large language models (LLMs) — which are limited to text — by incorporating visual perception, enabling applications that require understanding the content of images, charts, screenshots, documents, and video frames alongside linguistic reasoning.

The development of VLMs represents one of the most significant recent advances in AI, enabling machines to perform tasks previously requiring either separate vision and language pipelines or extensive task-specific engineering. As of 2025, the leading frontier models — GPT-4V (OpenAI), Gemini 2.5 Pro (Google DeepMind), and Claude (Anthropic) — are all VLMs capable of discussing images with high fidelity.

Architecture

VLMs generally follow a modular three-component design.

The vision encoder processes an input image and produces a sequence of visual feature vectors. The dominant choice is a Vision Transformer (ViT), which divides the image into fixed-size patches, linearly projects each patch into a token embedding, and processes the sequence through Transformer self-attention layers. Models such as CLIP's ViT-L/14 and ViT-G/14 are widely used as vision encoders because they have been pretrained on image-text pairs and produce semantically rich visual representations.

The projector (also called a connector or adapter) bridges the vision encoder and the language model by mapping visual feature vectors into the same embedding space as language tokens. Common projector designs include a multi-layer perceptron (MLP), a cross-attention layer, or a Q-Former as used in BLIP-2.

The language model decoder receives the combined sequence of visual and textual tokens and generates the output response autoregressively. State-of-the-art VLMs use instruction-tuned decoder-only Transformers such as Llama or proprietary models as their language backbone.

Training Paradigm

VLMs are typically trained in multiple stages. First, the vision encoder is pretrained on large image or image-text datasets. Second, the projector is trained on image-text pairs to align visual and linguistic representations, usually keeping the vision encoder and language model frozen. Third, the full model is fine-tuned on visual instruction-following datasets to develop conversational capabilities. Finally, RLHF or direct preference optimisation (DPO) may be applied to align outputs with human preferences.

CLIP (Contrastive Language-Image Pretraining, OpenAI, 2021) pioneered large-scale image-text alignment using a contrastive objective across 400 million image-caption pairs, producing a vision encoder capable of matching images to text descriptions. CLIP's visual representations became the de facto starting point for subsequent VLMs.

LLaVA (Visual Instruction Tuning, Liu et al., 2023) introduced a simple and influential VLM architecture connecting a CLIP vision encoder to an LLaMA language model via a linear projector, trained on GPT-4-generated visual instruction data. LLaVA demonstrated that high-quality multimodal instruction-following could be achieved with relatively modest training compute.

Capabilities and Benchmarks

VLMs are evaluated on a range of visual reasoning benchmarks. VQAv2 tests factual question answering about images. MMBench, MMMU, and MMStar assess broader multimodal understanding including charts, diagrams, and multi-image reasoning. TextVQA and DocVQA evaluate reading and understanding text within images. ScienceQA tests multimodal scientific reasoning.

Major 2025 VLMs demonstrate strong performance across these benchmarks. Gemini 2.5 Pro supports over one million token context windows, enabling video and long document understanding. Qwen2.5-VL (Alibaba) is a competitive open-weight VLM with strong performance on document and chart understanding.

Applications

In healthcare, VLMs analyse medical images — radiology scans, pathology slides, dermatology images — alongside clinical notes, enabling AI-assisted diagnosis and report generation. In document understanding, VLMs process scanned documents, forms, invoices, and tables, extracting structured information without requiring separate OCR pipelines. In robotics and embodied AI, VLMs serve as the perception and reasoning backbone for robots that must interpret visual scenes and follow natural language instructions. In accessibility, VLMs generate natural language descriptions of images for visually impaired users.

Malaysian Context — VLM Adoption in Document Processing and Healthcare

Vision-language models are seeing early adoption in Malaysia primarily through document processing and financial services. Malaysian banks — including Maybank, CIMB, and AmBank — process large volumes of forms, contracts, and identity documents that require both visual layout understanding and textual extraction. VLM-based document automation pipelines, deployed via cloud APIs from Microsoft Azure and Amazon Bedrock, are reducing manual processing costs for loan applications, KYC verification, and trade finance documentation.

The Malaysian healthcare system is exploring VLMs for medical imaging analysis, particularly in underserved public hospitals where radiologist coverage is limited. Hospital Universiti Kebangsaan Malaysia (HUKM) and Universiti Malaya Medical Centre (UMMC) have published research on AI-assisted chest X-ray and retinal imaging analysis, with VLM-based systems offering the potential to triage high-priority cases.

Malaysia's National Language and Literary Council (Dewan Bahasa dan Pustaka) has expressed interest in VLMs capable of reading and interpreting Jawi script — the Arabic-derived writing system used for classical Malay texts — alongside romanised Bahasa Malaysia, enabling digitisation of historical manuscripts.

In manufacturing, Penang's electronics and semiconductor cluster uses VLM-based visual inspection systems for printed circuit board (PCB) quality control, combining visual defect detection with natural language reporting that allows engineers to query inspection findings conversationally.

MDEC's AI for All initiative and Digital Nasional Berhad's (DNB) 5G rollout are creating infrastructure conditions that make cloud-based VLM APIs more accessible to Malaysian SMEs, supporting adoption of visual AI capabilities without requiring on-premises model deployment.

References

Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. ICML 2021. OpenAI.
Liu, H., et al. (2023). Visual instruction tuning. NeurIPS 2023.
Li, J., et al. (2023). BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML 2023.
Hugging Face. (2024). Vision language models explained. huggingface.co/blog/vlms.
Wikipedia. (2025). Vision-language model. en.wikipedia.org.

Tags:vision language model VLM multimodal AI GPT-4V visual reasoning

Type	Multimodal AI model
Key examples	GPT-4V, Gemini, Claude, LLaVA, Qwen2.5-VL
Core components	Vision encoder, projector, language model decoder
Key use	Image captioning, visual QA, medical imaging, robotics
Related	Multimodal AI, large language models, foundation model, CLIP

Architecture

Training Paradigm

Capabilities and Benchmarks

Applications

See Also

References

References