AIWiki
Malaysia

Image Segmentation

A computer vision task that partitions an image into meaningful regions by assigning a class label to every pixel, enabling pixel-level understanding of visual scenes.

6 min readLast updated May 2026Applications

Image segmentation is a computer vision task in which an image is divided into regions and a label is assigned to every pixel rather than to the image as a whole. Unlike image classification, which produces a single class for the entire image, and object detection, which produces bounding boxes, segmentation yields a dense, pixel-level map of the scene. This finer resolution makes segmentation the foundation for any application that requires precise understanding of object shape, boundary, or area — including medical diagnosis, autonomous driving, agricultural monitoring, and augmented reality.

Variants

Segmentation tasks are usually grouped into four main variants.

Semantic segmentation assigns each pixel to a category from a fixed set of classes. All instances of the same class are treated identically — every car pixel receives the "car" label without distinguishing one vehicle from another.

Instance segmentation identifies and separates individual objects of the same class. Each car receives its own mask. Mask R-CNN, introduced by He and colleagues in 2017, is the canonical architecture and is still widely deployed.

Panoptic segmentation, proposed by Kirillov and colleagues in 2019, unifies the two by labelling every pixel with both a semantic class and an instance ID. Stuff classes (sky, road, grass) are handled semantically while thing classes (people, vehicles) are handled per instance.

Interactive and promptable segmentation allows a user — or another model — to specify what to segment through clicks, boxes, or text prompts. The Segment Anything Model (SAM) and its successor SAM 2 popularised this style, producing masks for arbitrary objects with little or no per-domain training.

Key architectures

Early deep-learning segmentation relied on fully convolutional networks (FCNs), which replaced dense classification layers with convolutions to produce dense output. U-Net, introduced in 2015 for biomedical imaging, popularised the encoder-decoder structure with skip connections that combines coarse semantic features with fine spatial detail. It remains a workhorse architecture in medical imaging and satellite analysis.

DeepLab, developed at Google, introduced atrous (dilated) convolutions and spatial pyramid pooling to capture context at multiple scales without losing resolution. DeepLabV3+ remains a strong baseline for semantic segmentation in 2025.

Mask R-CNN extends the Faster R-CNN object detector with a mask head, producing instance segmentations with high accuracy. HRNet maintains high-resolution feature maps throughout the network and excels at fine boundary delineation.

Transformer-based methods including SegFormer, Mask2Former, and the SAM family have become dominant in recent years, often outperforming CNN baselines on standard benchmarks such as ADE20K, Cityscapes, and COCO. OMG-Seg, released in 2024, handles ten different segmentation tasks in a single unified model.

Metrics

Segmentation quality is most often measured by intersection over union (IoU), defined as the ratio of the overlap between prediction and ground truth to the area of their union. Mean IoU (mIoU) averages this across classes. The Dice coefficient is preferred in medical imaging where class imbalance is severe. Panoptic quality (PQ), introduced with panoptic segmentation, combines a recognition term and a segmentation term to measure both detection and mask quality.

Applications

Medical imaging is one of the largest application areas. Segmentation of tumours in MRI and CT scans, of cell nuclei in histopathology slides, and of retinal structures in fundus photographs has become routine in research settings and increasingly common in clinical workflows.

In autonomous driving, semantic and instance segmentation of road, vehicle, pedestrian, and lane-marking pixels feeds into planning and control systems. Satellite and aerial segmentation supports land-use classification, deforestation monitoring, and disaster response. Augmented reality applications use segmentation to separate users from backgrounds for video calls and to insert virtual objects behind real ones.

Challenges

Segmentation requires dense, pixel-level annotations that are expensive and slow to produce. Class imbalance — where target structures occupy a small fraction of pixels — complicates training and evaluation. Domain shift between training and deployment imagery remains a persistent issue, particularly for tropical agricultural, medical, and satellite applications where most public datasets reflect temperate or Western contexts. The rise of promptable foundation models such as SAM 2 has partially mitigated annotation cost but introduced new questions around prompt design and downstream evaluation.

References

  1. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI.
  2. He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). Mask R-CNN. ICCV.
  3. Kirillov, A. et al. (2023). Segment Anything. ICCV.
  4. Ravi, N. et al. (2024). SAM 2: Segment Anything in Images and Videos. Meta AI Research.
  5. Minaee, S. et al. (2022). Image Segmentation Using Deep Learning: A Survey. IEEE TPAMI.