Segment Anything 2 (SAM2): Promptable Segmentation

Updated 25 November 2025

Segment Anything 2 is a foundational promptable segmentation model that integrates a hierarchical Vision Transformer with a streaming memory subsystem for real-time, multi-frame processing.
The architecture features a flexible prompt encoder and a unified mask decoder that combine spatial, temporal, and multi-modal inputs to produce accurate segmentation masks.
SAM2 achieves state-of-the-art performance across diverse domains, offering efficient zero-shot and interactive segmentation along with parameter-efficient domain adaptation.

Segment Anything Model 2 (SAM2) defines a new paradigm in foundational promptable visual segmentation across both images and videos. Leveraging a hierarchical transformer backbone and an explicit temporal memory mechanism, SAM2 operates in real time and demonstrates state-of-the-art performance for zero-shot and interactive segmentation tasks, generalizing efficiently across diverse domains and input modalities.

1. Architectural Overview

SAM2 is composed of three principal modules: a hierarchical Vision Transformer (Hiera) image encoder, a prompt encoder that flexibly fuses spatial (points, boxes, masks) or textual inputs, and a unified mask decoder. The pivotal advancement over its predecessor is the inclusion of a streaming memory subsystem, which enables temporally consistent multi-frame and multi-modal reasoning. The design is summarized as follows:

Image Encoder (Hiera): Processes raw images or video frames into multi-scale features ( $F_e^i$ ) with patch-merging and multi-resolution attention. High-resolution streams ( $F_{high1}, F_{high2}$ ) facilitate fine detail.
Prompt Encoder: Converts user input (sparse points, boxes, masks) into dense prompt embeddings, integrated with improved spatial localization and boundary adherence.
Mask Decoder: Employs a lightweight transformer stack, merged via cross-attention with both image and prompt features, to produce segmentation masks and associated object-presence scores.
Memory Encoder/Bank & Memory Attention: For videos (or pseudo-sequential data), previous frame embeddings and associated positional encodings form a memory bank. Memory attention fuses context across time by augmenting current frame features with stored memory:

$F_c^i = Att_m(F_e^i, [V_{fea}^{1..i-1}], [V_{pos}^{1..i-1}])$

This architecture is exploited for both classical video and extended to multi-modal data by interpreting different sensor modalities as virtual frames (Liao et al., 9 Mar 2025, Ravi et al., 1 Aug 2024).

2. Data Engine and Foundation Dataset

SAM2’s capabilities are underpinned by the SA-V video dataset, the largest open segmentation dataset constructed using an interactive three-phase data engine and human-in-the-loop curation. Key points include:

Interactive Data Engine: Rapid iterative annotation leveraging early SAM/SAM2 checkpoints allows human corrections to propagate across frames, drastically improving annotation throughput (37.8 s/frame in Phase 1 to 4.5 s/frame in Phase 3).
Dataset Scale and Distribution: SA-V comprises over 50,900 videos and 35.5 million masks, with broad geographic, scene, and object coverage, substantially exceeding previous datasets.
Augmentations and Preprocessing: Each frame is subject to resolution normalization (1024 × 1024), aggressive geometric, photometric, and cropping augmentations, and dense prompt grid generation for masklet bootstrapping (Ravi et al., 1 Aug 2024).

This extensive data backbone directly translates to strong zero-shot and data-efficient adaptation properties in new domains (Ravi et al., 1 Aug 2024, Geetha et al., 12 Aug 2024).

3. Memory Mechanisms and Extensions

The temporal memory subsystem is central to SAM2’s unique performance in multi-frame segmentation and generalized input handling:

Streaming Memory: A FIFO bank contains compressed representations from previous frames; these tokens provide temporal context during cross-attention, maintaining identity and mask coherence through occlusion, distractors, and abrupt scene changes.
Multi-modal Extension: In multi-modal semantic segmentation, the mechanism is reinterpreted by treating modalities (e.g., RGB, depth, LiDAR) as sequential “frames,” enabling fusion via memory attention—this approach outperforms dedicated multi-modal competitors, e.g., MemorySAM achieves 65.38% mIoU on DELIVER with four modalities (Liao et al., 9 Mar 2025).
Semantic Prototype Memory Module (SPMM): Transitioning from instance to semantic segmentation, SPMM maintains momentum-updated class prototypes across the dataset, with a prototypical adaptation loss:

$\mathcal{L}_{proto} = \mathrm{MSE}(\mathrm{vec}(P_{glob}), \mathrm{vec}(P_{cur})) \cdot \frac{H}{4} \frac{W}{4}$

Adapter and LoRA-based Parameter Efficiency: Targeted fine-tuning via LoRA or depthwise-dilated adapters (DD-Adapter) in the memory or encoder modules allows parameter-efficient specialization for new domains or few-shot transfer, introducing only 0.9% additional trainable weights and achieving strong results in FSS and biomedical segmentation (Forni et al., 15 Sep 2025, Xu et al., 19 Jul 2025).

4. Performance Benchmarks and Behavioral Analysis

SAM2 delivers substantial gains over its predecessor and other state-of-the-art methods across multiple tasks, with fine-grained analysis in real-world and synthetic settings:

Scenario	Metric / Dataset	SAM2 Result	Notable Comparison / Finding
Multi-modal semantic (DELIVER)	mIoU	65.38%	+6.2pt over CMNeXt, +1.3pt over MLE-SAM (Liao et al., 9 Mar 2025)
Underwater instance segmentation	mAP (UIIS, BBox prompt)	70.6	+4.8 AP, 4× speedup vs. SAM ViT-Huge (Lian et al., 6 Aug 2024)
Class-agnostic instance (CIS)	AP (COD10K, BBox prompt)	68.8	Outperforms prior best by large margin (Pei et al., 4 Sep 2024)
High-res segmentation (HRSOD-TE)	F^{max}	0.980 (MGD-SAM2)	SAM2 baseline 0.963; 1–3% gain with multi-view adapters (Shen et al., 31 Mar 2025)
Biomedical (Endoscopy, DSC)	DSC	0.5382 (SAM2, 5-click)	U-Net: 0.6264; BioSAM2: 0.6251 (Yan et al., 6 Aug 2024)

Qualitative inspection reveals:

Prompt Dependency: Performance in both natural and specialized domains (e.g., marine, biomedical) is extremely sensitive to prompt quality. Ground-truth bounding box prompts yield state-of-the-art instance segmentation; automatic uniform point strategies often underperform due to ambiguity and over-segmentation (Lian et al., 6 Aug 2024, Pei et al., 4 Sep 2024).
Robustness: Analysis under complex video transformations demonstrates that each cross-attention stage (memory, prompt) incrementally suppresses noise and distractors, maintaining mask IoU within 1–2% of unperturbed frames even with occlusion or object confounders (Bromley et al., 25 Feb 2025).
Ablation Outcomes: Incorporating memory mechanisms and semantic prototypes into multi-modal pipelines consistently increases mIoU by 3–5% over simpler LoRA-tuned baselines. Fine-level adapters (MGD-SAM2, DD-SAM2) can further enhance detail without full retraining (Liao et al., 9 Mar 2025, Xu et al., 19 Jul 2025, Shen et al., 31 Mar 2025).

5. Adaptation, Quantization, and Domain Specialization

SAM2’s modular structure supports rapid adaptation and efficiency improvements for resource-constrained or domain-specific contexts:

Few-shot and Medical Specialization: FS-SAM2 applies LoRA to the encoder and memory blocks, achieving 73.4% mIoU (1-shot, PASCAL-5ⁱ⁾ with 0.9% parameter overhead (Forni et al., 15 Sep 2025). DD-SAM2 injects depthwise-dilated adapters for medical video segmentation with 0.93 Dice on TrackRad2025 (cine-MRI) and 0.97 on EchoNet-Dynamic (ultrasound) (Xu et al., 19 Jul 2025).
Efficient Quantization: Standard uniform quantization fails due to heavy-tailed weight distributions. Q-SAM2 introduces calibration by regularized pseudoinverse minimization and σ-based clipping in quantization-aware training, enabling 16× compression (2-bit) with minimal mIoU degradation and up to 66% mIoU gain in post-training quantization (Farronato et al., 11 Jun 2025).
Unsupervised Cell Tracking: SAM2’s representation supports zero-shot cell tracking in biomedical video by prompt-conditioned cross-frame linking, achieving top-3 linking accuracy on 6/13 benchmarks without any supervised training or adaptation (Chen et al., 12 Sep 2025).

6. Limitations, Open Challenges, and Research Directions

Empirical and ablation results across evaluations highlight key limitations and avenues for further research:

Prompt Sensitivity: Zero-shot instance and semantic segmentation in challenging domains (e.g., underwater, medical, camouflaged scenes) collapses without high-quality spatial cues. Investigating detector-based, confidence-guided, or multi-modal prompt generation is critical (Lian et al., 6 Aug 2024, Pei et al., 4 Sep 2024).
Fine Structure and Boundary Recovery: Despite hierarchical backbones and multi-scale adaptation, both standard SAM2 and its variants struggle with extremely fine details in high-resolution images—a challenge partly addressed by detail refinement modules and multi-view interaction (Shen et al., 31 Mar 2025).
Memory Control and Policy Learning: Fixed memory update heuristics limit performance in high-variance video/object tracking. Reinforcement-learned memory policies (SAM2RL) obtain >3× the improvement over hand-crafted rules in tracking quality, pointing to large untapped potential in memory-bank control (Adamyan et al., 11 Jul 2025).
Semantic and Contextual Awareness: SAM2 is class-agnostic by construction; bridging to semantic consistency (via SPMM/metaclass prototypes) raises segmentation performance but multi-class few-shot adaptation remains non-trivial (Liao et al., 9 Mar 2025, Forni et al., 15 Sep 2025).
Generalization and Domain Adaptation: Domain gaps, particularly in biomedical and remote-sensing, may require targeted fine-tuning of only memory or adapter modules, or full minimal retraining of the image encoder to maintain generalization (Yan et al., 6 Aug 2024, Xu et al., 19 Jul 2025).