What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Vision Transformer

The Vision Transformer (ViT) is a deep learning model that applies the transformer architecture originally designed for NLP directly to sequences of image patches, achieving state-of-the-art results on visual recognition tasks.

5 min readLast updated June 2026Foundations

The Vision Transformer (ViT) is a deep learning model introduced by Google Research in October 2020 that adapts the Transformer architecture — originally developed for natural language processing — to image understanding tasks. By treating an image as a sequence of fixed-size patches, ViT demonstrated that pure self-attention mechanisms could match or exceed the performance of convolutional neural networks (CNNs) on image recognition benchmarks when trained on sufficiently large datasets. The ViT architecture has since become a widely adopted foundation for computer vision research and production systems.

Architecture

ViT processes images by dividing them into a grid of non-overlapping patches of fixed resolution. For a standard input image of 224x224 pixels with a patch size of 16x16, this produces 196 patches. Each patch is flattened into a 1D vector and linearly projected into a lower-dimensional embedding space.

To preserve spatial information, learnable positional embeddings are added to each patch embedding before the sequence is fed into a standard Transformer encoder. A special classification token (analogous to BERT's CLS token) is prepended to the sequence; after passing through the encoder, this token's output representation is used by a classification head to produce the final label prediction.

The Transformer encoder consists of alternating layers of multi-head self-attention and feedforward networks, with layer normalisation applied before each sub-layer. This architecture allows every patch to attend to every other patch in the image from the very first layer, enabling global context modelling that differs fundamentally from CNNs, which build up receptive fields progressively through local convolutions.

Training and Data Requirements

A key finding from the original ViT paper was that pure self-attention models are data-hungry. When trained on mid-scale datasets such as ImageNet (1.2 million images), ViT underperformed well-regularised CNNs. However, when pre-trained on very large datasets — specifically Google's internal JFT-300M dataset with 300 million images — ViT achieved state-of-the-art results on ImageNet classification, surpassing leading CNN architectures such as EfficientNet and ResNet.

This data dependency motivated subsequent work on data-efficient training. The Data-efficient Image Transformers (DeiT) approach, developed by Facebook AI Research in 2021, introduced knowledge distillation techniques that allow ViT-sized models to match CNN performance when trained only on ImageNet, without access to large proprietary datasets.

Variants and Extensions

The ViT family has expanded substantially since the original release. The Swin Transformer (2021) introduced hierarchical feature maps and shifted window attention, making ViT-based models practical for dense prediction tasks such as object detection and semantic segmentation. The DINO method demonstrated that ViTs trained with self-supervised learning develop explicit semantic segmentation properties without any pixel-level supervision.

Larger ViT variants include ViT-G (1.8 billion parameters) and ViT-22B (22 billion parameters), the latter representing one of the largest vision models to date as of its release by Google in 2023. Hybrid architectures that combine convolutional stems with transformer encoders blend the inductive biases of CNNs with the global attention of transformers.

Comparison with CNNs

| Property | CNN | Vision Transformer | |---|---|---| | Global context | Only at deeper layers | From first layer | | Data requirement | Lower | Higher (mitigated by pretraining) | | Inductive bias | Strong (translation equivariance) | Weak | | Scalability | Good | Excellent with scale | | Edge deployment | Well-optimised | Improving |

By 2025, ViT-family models consistently outperform CNNs on large-scale benchmarks when pre-training data and compute are not constrained. On ImageNet-1k classification, top ViT models achieve over 90 percent top-1 accuracy. CNNs retain practical advantages in data-scarce regimes and on edge deployments where convolutional operations are hardware-optimised.

Applications

ViT and its variants have been deployed in medical image analysis (radiology, pathology slide classification), satellite and aerial imagery interpretation, autonomous vehicle perception, face recognition, and industrial quality inspection. The self-attention mechanism's ability to capture long-range dependencies makes ViT particularly well-suited to tasks where distant image regions need to be related — for example, identifying the relationship between a tumour and surrounding tissue in medical imaging.

Malaysian Context — Computer Vision and ViT Adoption in Malaysia

Computer vision applications underpinned by Vision Transformer architectures are increasingly deployed across Malaysian industries. The manufacturing sector, a cornerstone of Malaysia's economy including semiconductor fabrication and electronics assembly in Penang and Selangor, has adopted ViT-based visual inspection systems for automated defect detection. Companies such as Inari Amertron and Globetronics have explored AI-powered optical inspection systems at accuracy levels not achievable with rule-based machine vision.

In the agricultural sector, MARDI (Malaysian Agricultural Research and Development Institute) and private agritech companies have applied ViT-based image classification to crop disease detection, pest identification, and yield estimation from drone imagery. These applications are particularly relevant for Malaysia's oil palm and rubber plantation monitoring, where the scale of plantations makes manual inspection economically infeasible.

The Malaysian government's AI Roadmap and MyDigital Blueprint include computer vision as a priority technology for smart city and public safety applications. The Royal Malaysia Police (PDRM) and local municipal councils have piloted ViT-based surveillance analytics for traffic monitoring and public space management in urban centres including Kuala Lumpur and Penang.

MDEC and Cyberview have supported the development of AI talent in computer vision through programmes at digital hubs in Cyberjaya. Malaysian universities including Universiti Teknologi Malaysia (UTM) and Multimedia University (MMU) include Vision Transformer content in postgraduate AI curricula and have published research applying ViT architectures to Malay script recognition and regional satellite imagery analysis.

References

Dosovitskiy, A. et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. arXiv:2010.11929.
Touvron, H. et al. (2021). Training data-efficient image transformers and distillation through attention. ICML 2021.
Liu, Z. et al. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021.
Caron, M. et al. (2021). Emerging Properties in Self-Supervised Vision Transformers (DINO). ICCV 2021.
Hire AI Developer. (2025). How Vision Transformers (ViT) Are Redefining Computer Vision Architectures in 2025. medium.com.

Tags:vision-transformer computer-vision transformer image-recognition deep-learning

Developed by	Google Research, Brain Team
Released	October 2020
Architecture	Transformer encoder on image patches
Training data	JFT-300M, ImageNet-21k
Key use	Image classification, object detection
Related	DeiT, Swin Transformer, CLIP, DINO