Segment Anything Model 2 (SAM 2)
- SAM 2 is a promptable visual segmentation model that unifies image and video segmentation using a streaming-memory transformer architecture.
- The model introduces a hierarchical image encoder and a memory bank mechanism to support real-time, accurate segmentation across diverse benchmarks.
- Practical deployments, including 3D Slicer plugins and integration with object detectors, demonstrate SAM 2’s adaptability in biomedical and remote sensing applications.
Segment Anything Model 2 (SAM 2) is a foundation model developed by Meta AI for promptable visual segmentation in both static images and videos. SAM 2 extends the original Segment Anything Model (SAM), introducing a streaming-memory transformer architecture and a new paradigm for temporal and multi-frame interactive segmentation. It utilizes advanced memory mechanisms, comprehensive pretraining at scale, and an interactive data engine to achieve state-of-the-art performance across diverse segmentation tasks in both natural and specialized domains.
1. Architectural Innovations
SAM 2 unifies prompt-driven segmentation in images and videos through a modular transformer-based system. Key architectural enhancements over SAM include:
- Hierarchical Image Encoder (Hiera): Replaces the single-scale ViT backbone of SAM 1 with a four-stage MAE-pretrained Hiera encoder, producing multi-scale features at strides 4, 8, 16, and 32. Fine-resolution outputs directly enter mask upsampling, while coarse features form the memory path (Ravi et al., 2024, Geetha et al., 2024).
- Streaming Memory Bank: At each timestep, SAM 2 processes a frame, attends to a memory bank of the previous N=6 unprompted frames and M=1 prompted frames (storing spatial maps and pointer vectors), and maintains temporal consistency via cross-attention with memory tokens (Ravi et al., 2024).
- Memory Encoder and Attention: For each frame t, SAM 2 encodes the predicted mask and image features into a memory entry , which is appended to the memory bank. During inference or training, L=4 transformer blocks implement memory-equipped cross-attention with RoPE positional encoding (Ravi et al., 2024).
- Prompt Encoder and Mask Decoder: The prompt encoder handles points, boxes, masks, or text, projecting them as prompt tokens. The mask decoder integrates image, prompt, and memory-conditioned features via cross-attention to output multiple candidate masks and per-frame visibility scores (Ravi et al., 2024).
- Occlusion Head: Introduces an explicit occlusion prediction scalar to indicate object presence in each frame.
The full inference pipeline is:
- Extract frame features with Hiera.
- Integrate memory via transformer cross-attention.
- Fuse current prompts and decode mask candidates.
- Encode mask and update memory bank.
Mathematically, the memory update and attention steps are:
where are current frame tokens, / are memory keys/values (Ravi et al., 2024, Geetha et al., 2024, Bromley et al., 25 Feb 2025).
2. Training Regime and Datasets
SAM 2 is trained via a two-stage paradigm:
- Stage 1: Large-scale image-level pretraining using the SA-1B dataset (1 billion masks on 11 million images), employing MAE-style self-supervision and promptable segmentation objectives (Ravi et al., 2024).
- Stage 2: Joint image-video training using the SA-V dataset, an interactive corpus of 50.9k videos (642.6k masklets, 35.5M masks). Annotators interactively apply prompts (clicks, boxes, masks), with model-in-the-loop accelerations reducing edit time from 37.8s to 4.5s per frame (Ravi et al., 2024, Geetha et al., 2024).
Optimization uses AdamW, batch size 256, resolution up to , with learning rates of (decoder) and 0 (encoder), weighted focal and dice losses, and explicit IoU and occlusion heads. Data mixing (SA-1B, SA-V, internal/OS VOS), strong augmentations, and multi-frame temporal reversal are used during training (Ravi et al., 2024).
3. Prompting, Memory, and Segmentation Modes
SAM 2 supports a flexible prompt interface and temporal propagation:
- Prompt Types: Points (foreground/background), bounding boxes, free-form masks, and text (via CLIP-based embeddings).
- Prompt Propagation: Single or sparse prompts can be placed on a subset of frames, after which SAM 2 propagates object masks through the entire sequence via its memory bank. Bi-directional or uni-directional propagation is supported (Dong et al., 2024, Yildiz et al., 2024).
- Interactive Correction: The propagate_in_video mechanism enables correction of segmentation across previous and future frames upon addition of new prompts, supporting interactive refinements in real-time (Wang et al., 2024).
- Application to 3D Data: Slices of 3D volumes are treated as video frames for volumetric annotation; point prompts on a subset of slices are propagated, and memory-based continuity maintains anatomical coherence across slices (Yildiz et al., 2024, Dong et al., 2024).
- Memory Control: While the standard memory replacement uses a fixed-size rolling bank, recent work has explored reinforcement learning-based control for memory replacement, yielding significant gains in tracking and segmentation accuracy (Adamyan et al., 11 Jul 2025).
4. Quantitative Performance and Benchmarks
SAM 2 sets new state-of-the-art results on a range of image and video segmentation tasks. Quantitative highlights include:
- Video Object Segmentation (VOS): Achieves 1 on DAVIS17 val, 2 on MOSE val, and outperforms prior foundation and tailored VOS models in zero-shot settings (Ravi et al., 2024).
- Interactive Video Segmentation: Requires 3 fewer user interactions than prior approaches (e.g., SAM+XMem++, SAM+Cutie) at equivalent or higher accuracy. Phase 3 annotation times reduced to 4.5s/frame (from 37.8s) and click count per frame to 2.68 (from 4.8) (Ravi et al., 2024).
- Image Segmentation: On the SA-23 and 14 new video-derived image benchmarks, SAM 2 (Hiera-L, mixed data) attains 4 (1/5-click), outperforming SAM and HQ-SAM, and matches or exceeds best-in-class supervised models without fine-tuning (Ravi et al., 2024).
- Medical/Specialized Domains: SAM 2 (zero-shot) yields Dice scores up to 5 in surgical video under frame-sparse prompting, approaching or exceeding domain-supervised baselines (UNet, DeepLabv3+). In remote sensing, SAM 2 with user-box prompts surpasses CNN and YOLO-based pipelines under challenging lighting/resolution scenarios (Shen et al., 2024, Rafaeli et al., 2024, Dong et al., 2024).
Selected table:
| Task/Domain | Metric | SAM 2 | Best Baseline | Reference |
|---|---|---|---|---|
| DAVIS17 (val) VOS | 6 | 91.6 | 88.1 (Cutie+) | (Ravi et al., 2024) |
| SA-23 Zero-shot Img | 7 (1/5-click) | 63.5/83.5 | 60.8/82.1 (SAM) | (Ravi et al., 2024) |
| Surg. Video, reg. | Dice (N=300 prompts) | 0.91 | 0.94 (UNet) | (Shen et al., 2024) |
| Med. 3D, bidir. prop. | IoU (avg) | up to 0.19 | — | (Dong et al., 2024) |
| Remote Sensing (PV) | IoU (box prompt) | up to 0.79 | 0.71 (Eff-UNet) | (Rafaeli et al., 2024) |
5. Practical Deployments and Extensions
Several studies have demonstrated the practicality and extensibility of SAM 2:
- 3D Slicer Plugin: The SegmentWithSAM extension integrates SAM 2 into 3D Slicer for annotation of CT/MR volumes, enabling point-prompts and propagation across slices in both single/bidirectional modes (Yildiz et al., 2024).
- Det-SAM2 Framework: Automates video segmentation by coupling SAM 2 with YOLO-v8 object detector, providing self-prompting and buffer/window-based memory optimization for infinite video streams at constant memory. This supports real-time applications such as AI-based sports refereeing with efficient memory control (Wang et al., 2024).
- Domain Adaptation (BioSAM 2): Fine-tuning the image encoder and mask decoder while freezing the prompt and memory modules enables BioSAM 2 to match or surpass specialist models (nnU-Net, U-Mamba) in biomedical image and video tasks, supporting the value of foundation models with targeted adaptation (Yan et al., 2024).
- Reinforcement Learning for Memory Control: RL policies for memory bank updates (SAM2RL) yield a 8 point (pp) gain in tracking quality (9 vs. 0), exceeding heuristic-based improvements by a factor of 1 (Adamyan et al., 11 Jul 2025).
6. Limitations and Open Challenges
Despite its advances, SAM 2 presents several limitations and ongoing research questions:
- Prompt Dependence: Performance in zero-shot, promptless (auto-mode) settings often sharply degrades, as observed in camouflaged object detection where recall and region proposals drastically decrease relative to SAM (Tang et al., 2024).
- Temporal Range and Occlusions: Segmentation quality degrades under rapid object motion, long-term occlusion, or when large inter-slice changes occur in volumetric data (Dong et al., 2024, Geetha et al., 2024).
- Multi-object Handling: SAM 2 processes objects independently, lacking inter-object attention, leading to confusion among visually similar instances (Geetha et al., 2024).
- Compute and Latency: While single-frame per-image inference is fast (Hiera-B+ at 130 FPS), prompt processing and mask decoding with many candidates or large memory banks incur overheads, especially in live deployments (Ravi et al., 2024).
- Automatic Segmentation: Fully automatic mode (no prompts) is not competitive with specialist models, and adaptive proposals or learned self-prompting remain an open area (Rafaeli et al., 2024, Tang et al., 2024, Wang et al., 2024).
7. Future Directions
Ongoing and proposed directions for advancing SAM 2 and similar models include:
- Enhancing Memory Mechanisms: Richer RL-based update policies, hierarchical or transformer-based memory embeddings, support for continuous weighting and automatic memory management (Adamyan et al., 11 Jul 2025).
- Inter-object and Motion-aware Modules: Incorporation of cross-object attention, learned motion priors, and explicit spatiotemporal constraints in the memory encoder to address crowded scenes and rapid changes (Geetha et al., 2024, Ravi et al., 2024).
- Semi/Unsupervised and Active Learning: Automatic expansion of mask annotation via pseudo-labeling, active learning, and reduced human-in-the-loop correction during data collection (Ravi et al., 2024).
- Domain Adaptation: Efficient fine-tuning strategies for specialized domains (medical, remote sensing), including adapter layers and prompt augmentation (Yan et al., 2024, Dong et al., 2024).
- Zero-Prompt and Self-Prompted Segmentation: Optimization of mask proposal thresholds and integration of self-prompting detectors to close the gap between user-driven and fully automatic segmentation workflows (Wang et al., 2024, Tang et al., 2024).
- Deployment at Scale: Optimization of inference time and memory for large-scale annotation, streaming video, and 3D segmentation, including batching, VRAM/RAM capping, and offloading strategies (Wang et al., 2024, Yildiz et al., 2024).
SAM 2 thus constitutes a foundational shift toward unified, promptable segmentation across modalities and domains, with memory-equipped transformers enabling consistent, interactive, and efficient video and volumetric segmentation (Ravi et al., 2024, Geetha et al., 2024, Wang et al., 2024). The released codebase and dataset (SA-V) support broad experimental and practical adoption.