Vision Transformer
The Vision Transformer (ViT) is a deep learning model that applies the transformer architecture originally designed for NLP directly to sequences of image patches, achieving state-of-the-art results on visual recognition tasks.
The Vision Transformer (ViT) is a deep learning model introduced by Google Research in October 2020 that adapts the Transformer architecture — originally developed for natural language processing — to image understanding tasks. By treating an image as a sequence of fixed-size patches, ViT demonstrated that pure self-attention mechanisms could match or exceed the performance of convolutional neural networks (CNNs) on image recognition benchmarks when trained on sufficiently large datasets. The ViT architecture has since become a widely adopted foundation for computer vision research and production systems.
Architecture
ViT processes images by dividing them into a grid of non-overlapping patches of fixed resolution. For a standard input image of 224x224 pixels with a patch size of 16x16, this produces 196 patches. Each patch is flattened into a 1D vector and linearly projected into a lower-dimensional embedding space.
To preserve spatial information, learnable positional embeddings are added to each patch embedding before the sequence is fed into a standard Transformer encoder. A special classification token (analogous to BERT's CLS token) is prepended to the sequence; after passing through the encoder, this token's output representation is used by a classification head to produce the final label prediction.
The Transformer encoder consists of alternating layers of multi-head self-attention and feedforward networks, with layer normalisation applied before each sub-layer. This architecture allows every patch to attend to every other patch in the image from the very first layer, enabling global context modelling that differs fundamentally from CNNs, which build up receptive fields progressively through local convolutions.
Training and Data Requirements
A key finding from the original ViT paper was that pure self-attention models are data-hungry. When trained on mid-scale datasets such as ImageNet (1.2 million images), ViT underperformed well-regularised CNNs. However, when pre-trained on very large datasets — specifically Google's internal JFT-300M dataset with 300 million images — ViT achieved state-of-the-art results on ImageNet classification, surpassing leading CNN architectures such as EfficientNet and ResNet.
This data dependency motivated subsequent work on data-efficient training. The Data-efficient Image Transformers (DeiT) approach, developed by Facebook AI Research in 2021, introduced knowledge distillation techniques that allow ViT-sized models to match CNN performance when trained only on ImageNet, without access to large proprietary datasets.
Variants and Extensions
The ViT family has expanded substantially since the original release. The Swin Transformer (2021) introduced hierarchical feature maps and shifted window attention, making ViT-based models practical for dense prediction tasks such as object detection and semantic segmentation. The DINO method demonstrated that ViTs trained with self-supervised learning develop explicit semantic segmentation properties without any pixel-level supervision.
Larger ViT variants include ViT-G (1.8 billion parameters) and ViT-22B (22 billion parameters), the latter representing one of the largest vision models to date as of its release by Google in 2023. Hybrid architectures that combine convolutional stems with transformer encoders blend the inductive biases of CNNs with the global attention of transformers.
Comparison with CNNs
| Property | CNN | Vision Transformer | |---|---|---| | Global context | Only at deeper layers | From first layer | | Data requirement | Lower | Higher (mitigated by pretraining) | | Inductive bias | Strong (translation equivariance) | Weak | | Scalability | Good | Excellent with scale | | Edge deployment | Well-optimised | Improving |
By 2025, ViT-family models consistently outperform CNNs on large-scale benchmarks when pre-training data and compute are not constrained. On ImageNet-1k classification, top ViT models achieve over 90 percent top-1 accuracy. CNNs retain practical advantages in data-scarce regimes and on edge deployments where convolutional operations are hardware-optimised.
Applications
ViT and its variants have been deployed in medical image analysis (radiology, pathology slide classification), satellite and aerial imagery interpretation, autonomous vehicle perception, face recognition, and industrial quality inspection. The self-attention mechanism's ability to capture long-range dependencies makes ViT particularly well-suited to tasks where distant image regions need to be related — for example, identifying the relationship between a tumour and surrounding tissue in medical imaging.
See Also
References
- Dosovitskiy, A. et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. arXiv:2010.11929.
- Touvron, H. et al. (2021). Training data-efficient image transformers and distillation through attention. ICML 2021.
- Liu, Z. et al. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021.
- Caron, M. et al. (2021). Emerging Properties in Self-Supervised Vision Transformers (DINO). ICCV 2021.
- Hire AI Developer. (2025). How Vision Transformers (ViT) Are Redefining Computer Vision Architectures in 2025. medium.com.