Papers
Topics
Authors
Recent
2000 character limit reached

SAM-2: Memory-Augmented Video Segmentation

Updated 12 November 2025
  • SAM-2 is a promptable, memory-augmented video segmentation model that utilizes a novel memory encoder and Hiera+FPN backbone for consistent mask propagation.
  • The model integrates a sliding window memory mechanism supporting both causal and acausal propagation, enabling efficient, interactive video annotation with minimal human input.
  • Deployment of SAM-2 demonstrates real-time performance improvements, with significant gains in annotation speed and potential advancements in multi-object tracking and contextual segmentation.

The Segment Anything Model 2 (SAM-2) is a promptable visual segmentation architecture from Meta that generalizes the original Segment Anything Model (SAM) from image to real-time, interactive video segmentation. SAM-2 introduces a memory-augmented, streaming architecture that supports consistent mask propagation across frames, exploits a novel backbone and feature-pyramid design, and underpins large-scale, interactive data engines for segmentation annotation. This marks a shift in segmentation foundation models from stateless per-frame masking to temporally coherent video object segmentation, with implications for annotation efficiency, real-time deployment, and future work in multi-object tracking, motion modeling, and context-aware segmentation (Geetha et al., 12 Aug 2024).

1. Architectural Advances and Model Pipeline

SAM-2 maintains the promptable foundation of SAM—handling points, boxes, masks, or text prompts—but substantially revises the pipeline to support video and temporal consistency. The architecture consists of an image encoder, a new memory encoder and memory bank, a prompt encoder, and a multi-source mask decoder.

  • Image Encoder (Backbone):
    • SAM-1 uses a pre-trained Vision Transformer (ViT).
    • SAM-2 replaces ViT with Hiera, a hierarchical masked-encoder transformer with a feature pyramid network (FPN).
    • FPN component merges stride-16/32 features for coarse mask generation and injects stride-4/8 outputs for higher-resolution mask detail through skip-connections.
  • Prompt Encoder:
    • Accepts both sparse (points, boxes) and dense (masks) prompts.
    • Integrates positional encoding for sparse prompts and a convolutional mechanism for dense masks, enhancing their injection into the temporal attention pipeline.
  • Memory Encoder/Bank:
    • Maintains a sliding window of key–value pairs representing prior frame embeddings and predicted masks.
    • The current frame's feature embedding (EtE_t) attends to memory bank entries via a four-layer transformer with both rotary and sinusoidal positional embeddings.
    • After each decoding, new keys and values derived from current features and outputs augment the memory bank.
  • Mask Decoder:
    • Receives: (1) memory-refined embeddings, (2) skip-connected high-resolution encoder features, and (3) prompt encodings.
    • Produces a segmentation mask for the given frame, capable of leveraging both temporally and spatially proximate information for accuracy.

The model supports frame-level and full-video propagation modes, enforcing temporal mask consistency through memory attention and facilitating correction via user clicks at sparse intervals.

2. Temporal Memory Mechanism

The core innovation in SAM-2 is recurrent memory attention, enabling masks to be propagated and refined over long video sequences while maintaining efficiency.

  • Memory Bank (MtM_t) Operation:
    • Stores encoded key–value pairs from recent and initial reference frames.
    • For the current frame tt, the embedding EtE_t cross-attends to previous memory entries:

    Attn(Qt,K1:t1,V1:t1)=softmax(QtK1:t1T/d)V1:t1\mathrm{Attn}(Q_t, K_{1:t-1}, V_{1:t-1}) = \mathrm{softmax}(Q_t K_{1:t-1}^T / \sqrt{d}) V_{1:t-1} - Upon mask decoding at frame tt, the new embedding and its associated mask token are appended to MtM_t.

  • Sliding Window:

    • The memory bank operates as a sliding temporal window, supporting both causal (forward-only) and acausal (forward/backward, offline) propagation modes.
    • In offline annotation or refinement, this facilitates efficient correction and propagation along the entire video sequence.
  • Propagation and Correction:
    • If new user prompts are introduced at a given frame, the system re-inferrs all subsequent (and possibly prior) frames, correcting masks as needed via memory attention.
    • The memory mechanism enables annotation with minimal manual input, relying on prompt-based corrections only when transformer memory fails.

3. Dataset Construction and Supervision

SAM-2's capabilities are enabled by a large-scale, interactive data engine and comprehensive multi-phase dataset curation strategy.

  • Video Data (SA-V Dataset):
    • 50.9K videos (average 14 s), comprising 642.6K “masklets” (per-frame masks).
    • 451.7K masklets auto-generated, 190.9K manually screened.
    • Various data augmentation procedures yielded an additional 69.6K masklets and 62.9K videos.
  • Annotation Protocol:
  1. SAM-only initial phase: Per-frame auto generation, manual refinement; low efficiency (37.8 s/frame).
  2. SAM1+SAM2 phase: SAM1 for key frames, SAM2 propagation, annotator click corrections, retraining; moderate efficiency (7.4 s/frame).
  3. SAM2-only phase: Minimal human input with heavy reliance on temporal memory and prompt propagation; highest efficiency (4.5 s/frame).
  • Supervision:
    • The only explicitly stated loss is the standard mask prediction loss from SAM-1 (no new temporal consistency or smoothness loss is specified).
    • Temporal consistency is learned implicitly due to memory-based frame propagation during training.

4. Performance, Throughput, and Resource Characteristics

Quantitative performance analysis in (Geetha et al., 12 Aug 2024) is restricted to qualitative descriptors.

  • Efficiency:
    • SAM-2 is reported to run at “near‐real‐time performance.”
    • Substantial annotation speed-up is achieved in data engine phases, reflecting the impact of memory-propagation and improved architecture.
  • Resource Requirements:
    • The Hiera backbone and efficient FPN/skip-connections balance the extra cost of the temporal module.
    • No explicit hardware benchmarks, FPS curves, or comparative tables are provided, but pragmatic claims support deployment in real-time or interactive settings.
  • Deployment Considerations:
    • Achieves practical throughput by limiting the memory bank’s window and efficient reuse of multi-scale features.
    • Designed for interactive correction and annotation with minimal latency overhead compared to stateless (per-frame) segmentation.

5. Limitations and Failure Cases

Several bottlenecks and failure modes are identified in the documentation (Geetha et al., 12 Aug 2024).

  • Struggles with:
    • Prolonged object occlusion.
    • Heavy scene clutter.
    • Rapid shot transitions.
    • Distinguishing visually similar objects in close proximity.
    • Multi-object coherence (objects are tracked independently, not contextually).
    • Necessity for manual verification of intermediate frames, especially in challenging sequences.
  • No architectural mechanisms for:
    • Explicit multi-object attention.
    • Explicit temporal-smoothness or inter-frame mask regularization.

A plausible implication is that advanced applications requiring inter-object contextual reasoning, robust re-detection after occlusion, or fine-grained sequence regularity will need supplementary modules beyond those offered in SAM-2 as described.

6. Future Directions

The study indicates several avenues for improvement (Geetha et al., 12 Aug 2024):

  • Motion Modeling:
    • Incorporation of explicit motion-compensation or re-identification modules to better handle occlusions and rapid disappearance/re-entry of objects.
  • Contextual Attention:
    • Designing inter-object attention or context modules for coherent segmentation of multiple objects in close spatial or temporal proximity.
  • Automation and Annotation Reduction:
    • Automated intermediate-frame annotation within the data engine to further reduce annotator burden and improve masklet consistency.
  • Temporal Regularization:
    • Exploration of direct temporal-consistency losses or regularizers in the training objective to stabilize propagation and reduce drift.
  • Scalability and Robustness:
    • Further optimizations in backbone selection, memory management, and mask-decoder scaling to support even higher-resolution video and more complex scenes.

In summary, SAM-2 extends image-only, prompt-based segmentation to interactive, memory-augmented video object segmentation by integrating a Hiera+FPN image encoder, a sliding-window memory attention module, and an efficient multi-source mask decoder. This design supports propagation of user-guided masks over extended video spans with minimal intervention, setting a foundation for further research at the intersection of promptable foundation models, real-time video segmentation, and scalable human-in-the-loop annotation (Geetha et al., 12 Aug 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Segment Anything Model 2 (SAM-2).