Papers
Topics
Authors
Recent
2000 character limit reached

Segmentation Anything Model (SAM)

Updated 25 January 2026
  • SAM is a promptable segmentation foundation model that unifies interactive and automatic segmentation using a modular architecture of image, prompt, and mask encoders.
  • It delivers robust zero-shot segmentation performance across domains such as natural images, medical imaging, and remote sensing, often achieving high mIoU scores.
  • Extensive training with the SA-1B dataset and interactive prompt engineering enables high-quality mask generation, though challenges like texture bias and prompt instability remain.

The Segment Anything Model (SAM) is a foundation model for promptable image segmentation, introduced by Meta AI in 2023 and rapidly extended in multiple directions. SAM’s core architectural innovation is to unify interactive and automatic segmentation within a single, large-scale, prompt-driven framework, enabling robust zero-shot mask generation across diverse visual domains. This paradigm has catalyzed advances not only in vision-LLMs but also in high-throughput annotation, medical imaging, remote sensing, and efficient model design.

1. Core Architecture and Promptable Segmentation

SAM decomposes segmentation into three principal modules: image encoder, prompt encoder, and mask decoder (Kirillov et al., 2023, Sun et al., 2024).

  • Image Encoder: A Vision Transformer (ViT), pre-trained with masked autoencoding on over 1 billion masks and 11 million images (SA-1B dataset), serves as the backbone. It processes arbitrary resolution input images, producing a dense, low-resolution feature map (e.g., for ViT-H/16, a 1024×1024 image yields a 64×64×256 tensor).
  • Prompt Encoder: User-supplied segmentation prompts—points, bounding boxes, polygons, masks, or even text—are mapped to fixed-size embeddings. Points and boxes are encoded via positional encodings plus learned tokens, masks through convolutional downsampling, and text (in SAM 2+/SAM 3) via a CLIP-based encoder.
  • Mask Decoder: A lightweight transformer bridges image and prompt embeddings, using multi-head self- and cross-attention layers to fuse information. The decoder outputs one or more binary masks, each with an associated confidence score. Ambiguity is explicitly accommodated, with multiple output tokens corresponding to different plausible masks per prompt.

The canonical formulation is:

y^s=MaskDecoder(ImgEnc(x),PromptEnc(P)),s=1,,S\hat{y}_s = \text{MaskDecoder}(\text{ImgEnc}(x), \text{PromptEnc}(P)),\quad s=1,\ldots,S

where xx is the image and PP the prompt (Kirillov et al., 2023).

This modular structure enables flexible “prompt engineering”: segmentation can be refined or redirected by varying the prompt type, number, and placement, unifying classical interactive approaches and fully automatic proposal generation.

2. Training Regime and Dataset Curation

SAM was trained using an iterative, bootstrapped data engine termed SA-1B (Kirillov et al., 2023):

  • Assisted-Manual Stage: Human annotators click foreground/background points, iteratively refining masks with the assistance of early model predictions.
  • Semi-Automatic Stage: Object detectors propose candidate masks, which annotators correct or verify. Each refinement loop uses updated model weights, thus scaling the number of annotated objects.
  • Fully Automatic Stage: A dense grid of prompts covers each image, the model produces multiple candidates per prompt, and post-processing—including predicted IoU filtering, stability testing, non-maximum suppression, and removal of spurious components—yields high-quality masks. Faces and license plates are blurred for privacy.

The result is 1.1 billion segmentation masks over 11 million privacy-preserving images, with rigorous automated and human-in-the-loop quality control (99.1% of masks generated automatically; >94% with IoU > 0.9). Training uses a compound loss involving focal and dice terms, with additional supervision for the mask confidence head.

3. Zero-Shot Generalization and Prompt Engineering

Because the image encoder is pretrained on massive, diverse data, SAM exhibits robust zero-shot generalization: it can segment novel objects using only a prompt at inference, without any fine-tuning (Kirillov et al., 2023, Maquiling et al., 2023, Osco et al., 2023, Ren et al., 2023).

  • Point Prompts: One or more points indicate foreground (inside the object) or background (outside). With a single point prompt, SAM routinely outperforms SOTA interactive methods (mIoU ≈ 52.5 vs. 46.7 for RITM) and with three points (mIoU ≈ 90) approaches oracle segmentation (Kirillov et al., 2023).
  • Bounding Boxes: Typically provide better guidance, especially for objects with ambiguous interiors. Zero-shot mIoU on COCO and LVIS benchmarks is ≥85% for box-prompted masks.
  • Mask Prompts: Coarse masks allow stepwise refinement.
  • Text Prompts: Introduced via CLIP encoders in SAM 2 and SAM 3, enabling open-vocabulary and multi-modal prompting (Sun et al., 2024, Carion et al., 20 Nov 2025).
  • Automatic “Everything” Mode: A grid of prompts is placed over the image, and extracted masks are post-processed to yield an over-complete set of object proposals.

Prompt quantity and placement materially affect performance, especially in multispecies or low-contrast domains (see Section 4). SAM v1 does not learn prompt types end-to-end; prompt design and sampling logic require manual tuning by the user or downstream pipeline.

4. Application Domains and Empirical Performance

SAM provides a promptable foundation for segmentation in multiple domains:

Natural and Everyday Images

On common computer vision benchmarks—COCO, LVIS, ADE20K, BSDS500, Cityscapes—SAM achieves mIoU ≈ 85–90% in zero-shot mode for well-defined objects, with high boundary accuracy for high-contrast, compact objects (Kirillov et al., 2023, Sun et al., 2024, Zhang et al., 2023).

Medical Imaging

Prominent studies apply SAM to digital pathology, radiotherapy, brain/liver tumor CT/MRI, and annotation bootstrapping (Deng et al., 2023, Zhang et al., 2023, Hu et al., 2023, Peivandi et al., 2023, Zhang et al., 2023).

  • For large, well-defined structures (tumors, organs), Dice scores in zero-shot mode with an optimal prompt reach 0.70–0.90 (comparable to established models).
  • For dense instance (cell nuclei) segmentation, performance is markedly lower even with many prompts (Dice ≤0.4–0.7) (Deng et al., 2023).
  • With point prompts only in multi-phase CT, performance climbs steeply with more interactions (e.g., 1 point: Dice = 0.38–0.63; 20 points: Dice = 0.76), but remains several points below fully supervised U-Nets (Hu et al., 2023).
  • Domain-specific fine-tuning of only the mask decoder (leaving ViT frozen) can boost zero-shot DSC from 0.54/0.38 (pretrained) to 0.88/0.67 (“improved SAM”/nnUNetv2) on brain tumor subregions (Peivandi et al., 2023).
  • In radiotherapy OAR tasks, box prompts yield Dice = 0.88–0.95 for large organs and 0.57–0.79 for smaller structures, comparable to clinical acceptability (Zhang et al., 2023).

Remote Sensing

In overhead imagery, SAM achieves competitive mask IoU for compact instances (solar panels, buildings: IoU = 0.54–0.69), but degrades on ill-posed segments (roads: IoU ≪ 0.2) or for very small objects. Domain gap—object scale, texture, and ambiguous boundaries—is a key limitation in satellite settings (Ren et al., 2023, Osco et al., 2023).

Plant Phenotyping and Other Specialized Tasks

Zero-shot pipelines using SAM plus simple color/shape filtering (e.g., “Leaf Only SAM”) approach supervised Mask R-CNNs in recall (0.63 vs. 0.79), requiring no new training data but suffering higher false positive rates without domain adaptation (Williams et al., 2023).

5. Limitations, Architectural Biases, and Mitigation

Texture Bias

Despite training to “cover object shapes,” SAM’s ViT encoder is demonstrably biased toward textural cues when texture and shape are in conflict. Quantitative experiments show that, in ~85% of texture–shape conflicts, SAM’s predicted mask aligns with the texture region, not the shape (Zhang et al., 2023). This bias is attributed to the model’s lack of explicit boundary supervision and the self-attention structure of ViTs, which propagates local feature similarities (often textural) over large spatial extents. Proposed mitigations include boundary-aware losses, silhouette data augmentation, and multi-point/polygonal prompts.

Instability under Casual Prompts

The original SAM exhibits instability—drifting to background features or failing to fully delineate objects—when provided imprecise boxes or few points. The Stable-SAM variant incorporates a lightweight deformable sampling plugin and dynamic routing based on prompt quality, increasing both mIoU and mask stability (mSF) by up to 30–40 points vs. vanilla SAM with noisy prompts. The approach adds only 0.08 million trainable parameters (Fan et al., 2023).

Determinism and Uncertainty

SAM is fundamentally deterministic: given fixed image and prompt, it returns a single segmentation. In real-world scenarios with annotation ambiguity (medical imaging, inter-expert variability), this is limiting. The "Probabilistic SAM" extension incorporates a latent Gaussian variable injected via a CVAE framework, training the decoder on a β-ELBO objective. This yields a distribution over masks per prompt, quantitatively matching the range of contested ground truth and outperforming dropout‐based or probabilistic U-Nets on "uncertainty-aware" metrics (GED ↓, Dice ↑, IoU ↑ by 15–23%) (Ward et al., 6 Sep 2025).

Resource Demand and Efficient Variants

SAM’s encoder dominates memory and computational cost (>98% of parameters, ≈90% of FLOPs). A growing ecosystem of efficient variants—MobileSAM, SqueezeSAM, EfficientSAM, StrongSAM, SlimSAM, Q-TinySAM—replace the ViT backbone with lighter networks, employ knowledge distillation, quantization, and pruning, or optimize mask decoders. Efficiency gains range from 5× to 100× reduction in runtime and memory with <3% accuracy drop (Sun et al., 2024). Video extensions (SAM 2) introduce memory banks and multi-scale feature pyramids to support near real-time mask propagation and multi-object tracking (Geetha et al., 2024).

6. Recent Extensions: Open Vocabulary and Concept Segmentation

SAM 2 (Geetha et al., 2024) and SAM 3 (Carion et al., 20 Nov 2025) extend promptable segmentation to video, arbitrary concept prompts, and joint detection-segmentation-tracking:

  • Text and Exemplar Prompts: SAM 3 supports promptable Concept Segmentation (PCS) via noun phrase and image exemplar prompts; a fusion transformer decouples recognition (“what”) via text/exemplar encoding from localization (“where”) via decoder queries and a memory-based video tracker.
  • Presence Head: The presence token separates detection of category from mask localization, boosting recognition accuracy.
  • Data Engine: 52M high-quality masks and 4M unique labels were assembled using an industrial-scale AI/human-in-the-loop engine, enabling exhaustive open-vocabulary benchmarks (SA-Co).
  • Performance: Doubling accuracy for promptable segmentation over previous OWLv2 methods (+17 AP on LVIS), with strong video tracking and counting (MAE = 0.12) (Carion et al., 20 Nov 2025).

7. Outlook and Future Directions

SAM establishes a new paradigm for vision foundation models: promptable, scalable, zero-shot segmenters with deep compositional flexibility. Key open directions include:

  • Domain adaptation via lightweight adapters (LoRA, visual prompt layers), especially in domains with annotation scarcity or extreme appearance shift.
  • Structured uncertainty estimation and probabilistic modeling at the prompt fusion stage, to better capture real-world ambiguity (Ward et al., 6 Sep 2025).
  • High-spatial-granularity improvements: integrating boundary-sensitive losses, higher-resolution decoders, or recursive multi-scale refinement for fine structures.
  • Hardware-aware model compression: dynamic pruning, quantization, and hybrid transformer-conv architectures, targeting real-time and edge deployment (Sun et al., 2024).
  • Multi-modal, multi-scale, and interactive pipelines: integrating text, language, and geometric cues to drive segmentation, supporting new applications in science, medicine, and geospatial analytics (Carion et al., 20 Nov 2025, Geetha et al., 2024).

Ongoing research is focused on narrowing the domain-generalization gap, mitigating texture biases, and further lowering computational barriers, toward achieving universal, uncertainty-aware, and real-time segmentation engines across the visual world.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Segmentation Anything Model (SAM).