AIWiki
Malaysia

Segment Anything Model

The Segment Anything Model (SAM) is a foundation model from Meta AI for promptable image and video segmentation, able to isolate any object from a click, box, or mask with strong zero-shot generalisation.

5 min readLast updated June 2026Models

The Segment Anything Model, abbreviated SAM, is a foundation model for image segmentation developed by Meta AI's Fundamental AI Research lab and released in April 2023. Segmentation is the computer-vision task of identifying which pixels in an image belong to a given object. SAM reframed this task around promptability: instead of being trained to recognise a fixed set of object categories, the model accepts a prompt, such as a point click, a bounding box, or a rough mask, and returns a precise segmentation mask for the indicated object.

The model's most significant property is strong zero-shot generalisation. Because it was trained on an extremely large and diverse dataset of images and masks, SAM can segment objects it was never explicitly trained to recognise, including objects in domains far from its training distribution. This makes it useful as a general-purpose building block that other systems can call rather than as a single-purpose classifier.

Architecture and training

SAM uses a transformer-based design with three parts: a heavyweight image encoder that processes the image once into an embedding, a lightweight prompt encoder that interprets the user's clicks or boxes, and a fast mask decoder that combines the two to produce masks in real time. Because the expensive image encoding is computed only once, the model can respond to many prompts on the same image interactively.

To train the original model, Meta built a data engine in which the model and human annotators worked together in a loop, the model proposing masks and annotators refining them, progressively producing the SA-1B dataset of over one billion masks across eleven million images.

SAM 2 and video

In 2024 Meta released SAM 2, the first unified model to segment objects across both images and video. SAM 2 introduced a streaming memory mechanism that lets it track an object through the frames of a video in real time, even as the object moves, is occluded, or reappears. On image segmentation it is reported to be roughly six times faster and more accurate than the original SAM, and on video it achieves better accuracy with about three times fewer user interactions than earlier approaches. It was trained on the SA-V dataset, the largest video-segmentation dataset to date, comprising roughly 51,000 real-world videos and more than 600,000 spatio-temporal masks. An updated SAM 2.1 followed, and the models were made available through repositories and cloud platforms including Amazon SageMaker.

Applications

SAM and SAM 2 are widely used as components in larger pipelines. They accelerate data labelling for training other vision models, support medical-image analysis, enable object removal and editing in photo and video tools, assist in satellite and aerial image analysis, and provide segmentation for robotics and augmented reality. Because the weights are openly released, developers can integrate the models directly rather than relying on a closed service.

| Version | Year | Capability | |---------|------|-----------| | SAM | 2023 | Promptable image segmentation | | SAM 2 | 2024 | Unified image and video, real-time tracking | | SAM 2.1 | 2024 | Improved accuracy and access |

References

  1. Kirillov, A., et al. (2023). Segment Anything. Meta AI / ICCV.
  2. Ravi, N., et al. (2024). SAM 2: Segment Anything in Images and Videos. Meta AI.
  3. Meta AI. (2024). Introducing Meta Segment Anything Model 2 (SAM 2). ai.meta.com/research/sam2.
  4. Meta AI. (2024). Expanding Access to Meta Segment Anything 2.1 on Amazon SageMaker JumpStart. ai.meta.com/blog.