Segment Anything Model 2 (SAM 2)

Updated 27 December 2025

SAM 2 is a promptable visual segmentation model that unifies image and video segmentation using a streaming-memory transformer architecture.
The model introduces a hierarchical image encoder and a memory bank mechanism to support real-time, accurate segmentation across diverse benchmarks.
Practical deployments, including 3D Slicer plugins and integration with object detectors, demonstrate SAM 2’s adaptability in biomedical and remote sensing applications.

Segment Anything Model 2 (SAM 2) is a foundation model developed by Meta AI for promptable visual segmentation in both static images and videos. SAM 2 extends the original Segment Anything Model (SAM), introducing a streaming-memory transformer architecture and a new paradigm for temporal and multi-frame interactive segmentation. It utilizes advanced memory mechanisms, comprehensive pretraining at scale, and an interactive data engine to achieve state-of-the-art performance across diverse segmentation tasks in both natural and specialized domains.

1. Architectural Innovations

SAM 2 unifies prompt-driven segmentation in images and videos through a modular transformer-based system. Key architectural enhancements over SAM include:

Hierarchical Image Encoder (Hiera): Replaces the single-scale ViT backbone of SAM 1 with a four-stage MAE-pretrained Hiera encoder, producing multi-scale features at strides 4, 8, 16, and 32. Fine-resolution outputs directly enter mask upsampling, while coarse features form the memory path (Ravi et al., 1 Aug 2024, Geetha et al., 12 Aug 2024).
Streaming Memory Bank: At each timestep, SAM 2 processes a frame, attends to a memory bank of the previous N=6 unprompted frames and M=1 prompted frames (storing spatial maps and pointer vectors), and maintains temporal consistency via cross-attention with memory tokens (Ravi et al., 1 Aug 2024).
Memory Encoder and Attention: For each frame t, SAM 2 encodes the predicted mask $\widehat Y_t$ and image features into a memory entry $M_t$ , which is appended to the memory bank. During inference or training, L=4 transformer blocks implement memory-equipped cross-attention with RoPE positional encoding (Ravi et al., 1 Aug 2024).
Prompt Encoder and Mask Decoder: The prompt encoder handles points, boxes, masks, or text, projecting them as prompt tokens. The mask decoder integrates image, prompt, and memory-conditioned features via cross-attention to output multiple candidate masks and per-frame visibility scores (Ravi et al., 1 Aug 2024).
Occlusion Head: Introduces an explicit occlusion prediction scalar $s_t\in[0,1]$ to indicate object presence in each frame.

The full inference pipeline is:

Extract frame features with Hiera.
Integrate memory via transformer cross-attention.
Fuse current prompts and decode mask candidates.
Encode mask and update memory bank.

Mathematically, the memory update and attention steps are: $M_t = f_{\mathrm{mem}}(M_{t-1}, X_t, \widehat Y_t)$

$O = \mathrm{softmax} \bigg( \frac{Q_t K_{\text{mem}}^T}{\sqrt{D}} \bigg) V_{\text{mem}}$

where $Q_t$ are current frame tokens, $K_{\text{mem}}$ / $V_{\text{mem}}$ are memory keys/values (Ravi et al., 1 Aug 2024, Geetha et al., 12 Aug 2024, Bromley et al., 25 Feb 2025).

2. Training Regime and Datasets

SAM 2 is trained via a two-stage paradigm:

Stage 1: Large-scale image-level pretraining using the SA-1B dataset (1 billion masks on 11 million images), employing MAE-style self-supervision and promptable segmentation objectives (Ravi et al., 1 Aug 2024).
Stage 2: Joint image-video training using the SA-V dataset, an interactive corpus of 50.9k videos (642.6k masklets, 35.5M masks). Annotators interactively apply prompts (clicks, boxes, masks), with model-in-the-loop accelerations reducing edit time from 37.8s to 4.5s per frame (Ravi et al., 1 Aug 2024, Geetha et al., 12 Aug 2024).

Optimization uses AdamW, batch size 256, resolution up to $1024^2$ , with learning rates of $3 \times 10^{-4}$ (decoder) and $6 \times 10^{-5}$ (encoder), weighted focal and dice losses, and explicit IoU and occlusion heads. Data mixing (SA-1B, SA-V, internal/OS VOS), strong augmentations, and multi-frame temporal reversal are used during training (Ravi et al., 1 Aug 2024).

3. Prompting, Memory, and Segmentation Modes

SAM 2 supports a flexible prompt interface and temporal propagation:

Prompt Types: Points (foreground/background), bounding boxes, free-form masks, and text (via CLIP-based embeddings).
Prompt Propagation: Single or sparse prompts can be placed on a subset of frames, after which SAM 2 propagates object masks through the entire sequence via its memory bank. Bi-directional or uni-directional propagation is supported (Dong et al., 1 Aug 2024, Yildiz et al., 27 Aug 2024).
Interactive Correction: The propagate_in_video mechanism enables correction of segmentation across previous and future frames upon addition of new prompts, supporting interactive refinements in real-time (Wang et al., 28 Nov 2024).
Application to 3D Data: Slices of 3D volumes are treated as video frames for volumetric annotation; point prompts on a subset of slices are propagated, and memory-based continuity maintains anatomical coherence across slices (Yildiz et al., 27 Aug 2024, Dong et al., 1 Aug 2024).
Memory Control: While the standard memory replacement uses a fixed-size rolling bank, recent work has explored reinforcement learning-based control for memory replacement, yielding significant gains in tracking and segmentation accuracy (Adamyan et al., 11 Jul 2025).

4. Quantitative Performance and Benchmarks

SAM 2 sets new state-of-the-art results on a range of image and video segmentation tasks. Quantitative highlights include:

Video Object Segmentation (VOS): Achieves $\mathcal{J}%%%%9%%%%\mathcal{F}=91.6$ on DAVIS17 val, $77.2$ on MOSE val, and outperforms prior foundation and tailored VOS models in zero-shot settings (Ravi et al., 1 Aug 2024).
Interactive Video Segmentation: Requires $>3\times$ fewer user interactions than prior approaches (e.g., SAM+XMem++, SAM+Cutie) at equivalent or higher accuracy. Phase 3 annotation times reduced to 4.5s/frame (from 37.8s) and click count per frame to 2.68 (from 4.8) (Ravi et al., 1 Aug 2024).
Image Segmentation: On the SA-23 and 14 new video-derived image benchmarks, SAM 2 (Hiera-L, mixed data) attains $mIoU=63.5/83.5$ (1/5-click), outperforming SAM and HQ-SAM, and matches or exceeds best-in-class supervised models without fine-tuning (Ravi et al., 1 Aug 2024).
Medical/Specialized Domains: SAM 2 (zero-shot) yields Dice scores up to $0.91$ in surgical video under frame-sparse prompting, approaching or exceeding domain-supervised baselines (UNet, DeepLabv3+). In remote sensing, SAM 2 with user-box prompts surpasses CNN and YOLO-based pipelines under challenging lighting/resolution scenarios (Shen et al., 7 Aug 2024, Rafaeli et al., 13 Aug 2024, Dong et al., 1 Aug 2024).

Selected table:

Task/Domain	Metric	SAM 2	Best Baseline	Reference
DAVIS17 (val) VOS	$\mathcal J%%%%14%%%%\mathcal F$	91.6	88.1 (Cutie+)	(Ravi et al., 1 Aug 2024)
SA-23 Zero-shot Img	$mIoU$ (1/5-click)	63.5/83.5	60.8/82.1 (SAM)	(Ravi et al., 1 Aug 2024)
Surg. Video, reg.	Dice (N=300 prompts)	0.91	0.94 (UNet)	(Shen et al., 7 Aug 2024)
Med. 3D, bidir. prop.	IoU (avg)	up to 0.19	—	(Dong et al., 1 Aug 2024)
Remote Sensing (PV)	IoU (box prompt)	up to 0.79	0.71 (Eff-UNet)	(Rafaeli et al., 13 Aug 2024)

5. Practical Deployments and Extensions

Several studies have demonstrated the practicality and extensibility of SAM 2:

3D Slicer Plugin: The SegmentWithSAM extension integrates SAM 2 into 3D Slicer for annotation of CT/MR volumes, enabling point-prompts and propagation across slices in both single/bidirectional modes (Yildiz et al., 27 Aug 2024).
Det-SAM2 Framework: Automates video segmentation by coupling SAM 2 with YOLO-v8 object detector, providing self-prompting and buffer/window-based memory optimization for infinite video streams at constant memory. This supports real-time applications such as AI-based sports refereeing with efficient memory control (Wang et al., 28 Nov 2024).
Domain Adaptation (BioSAM 2): Fine-tuning the image encoder and mask decoder while freezing the prompt and memory modules enables BioSAM 2 to match or surpass specialist models (nnU-Net, U-Mamba) in biomedical image and video tasks, supporting the value of foundation models with targeted adaptation (Yan et al., 6 Aug 2024).
Reinforcement Learning for Memory Control: RL policies for memory bank updates (SAM2RL) yield a $\sim 5$ point (pp) gain in tracking quality ($76.86$ vs. $71.95$), exceeding heuristic-based improvements by a factor of $>3$ (Adamyan et al., 11 Jul 2025).

6. Limitations and Open Challenges

Despite its advances, SAM 2 presents several limitations and ongoing research questions:

Prompt Dependence: Performance in zero-shot, promptless (auto-mode) settings often sharply degrades, as observed in camouflaged object detection where recall and region proposals drastically decrease relative to SAM (Tang et al., 31 Jul 2024).
Temporal Range and Occlusions: Segmentation quality degrades under rapid object motion, long-term occlusion, or when large inter-slice changes occur in volumetric data (Dong et al., 1 Aug 2024, Geetha et al., 12 Aug 2024).
Multi-object Handling: SAM 2 processes objects independently, lacking inter-object attention, leading to confusion among visually similar instances (Geetha et al., 12 Aug 2024).
Compute and Latency: While single-frame per-image inference is fast (Hiera-B+ at 130 FPS), prompt processing and mask decoding with many candidates or large memory banks incur overheads, especially in live deployments (Ravi et al., 1 Aug 2024).
Automatic Segmentation: Fully automatic mode (no prompts) is not competitive with specialist models, and adaptive proposals or learned self-prompting remain an open area (Rafaeli et al., 13 Aug 2024, Tang et al., 31 Jul 2024, Wang et al., 28 Nov 2024).

7. Future Directions

Ongoing and proposed directions for advancing SAM 2 and similar models include:

Enhancing Memory Mechanisms: Richer RL-based update policies, hierarchical or transformer-based memory embeddings, support for continuous weighting and automatic memory management (Adamyan et al., 11 Jul 2025).
Inter-object and Motion-aware Modules: Incorporation of cross-object attention, learned motion priors, and explicit spatiotemporal constraints in the memory encoder to address crowded scenes and rapid changes (Geetha et al., 12 Aug 2024, Ravi et al., 1 Aug 2024).
Semi/Unsupervised and Active Learning: Automatic expansion of mask annotation via pseudo-labeling, active learning, and reduced human-in-the-loop correction during data collection (Ravi et al., 1 Aug 2024).
Domain Adaptation: Efficient fine-tuning strategies for specialized domains (medical, remote sensing), including adapter layers and prompt augmentation (Yan et al., 6 Aug 2024, Dong et al., 1 Aug 2024).
Zero-Prompt and Self-Prompted Segmentation: Optimization of mask proposal thresholds and integration of self-prompting detectors to close the gap between user-driven and fully automatic segmentation workflows (Wang et al., 28 Nov 2024, Tang et al., 31 Jul 2024).
Deployment at Scale: Optimization of inference time and memory for large-scale annotation, streaming video, and 3D segmentation, including batching, VRAM/RAM capping, and offloading strategies (Wang et al., 28 Nov 2024, Yildiz et al., 27 Aug 2024).

SAM 2 thus constitutes a foundational shift toward unified, promptable segmentation across modalities and domains, with memory-equipped transformers enabling consistent, interactive, and efficient video and volumetric segmentation (Ravi et al., 1 Aug 2024, Geetha et al., 12 Aug 2024, Wang et al., 28 Nov 2024). The released codebase and dataset (SA-V) support broad experimental and practical adoption.