SAM-2: Memory-Augmented Video Segmentation

Updated 12 November 2025

SAM-2 is a promptable, memory-augmented video segmentation model that utilizes a novel memory encoder and Hiera+FPN backbone for consistent mask propagation.
The model integrates a sliding window memory mechanism supporting both causal and acausal propagation, enabling efficient, interactive video annotation with minimal human input.
Deployment of SAM-2 demonstrates real-time performance improvements, with significant gains in annotation speed and potential advancements in multi-object tracking and contextual segmentation.

The Segment Anything Model 2 (SAM-2) is a promptable visual segmentation architecture from Meta that generalizes the original Segment Anything Model (SAM) from image to real-time, interactive video segmentation. SAM-2 introduces a memory-augmented, streaming architecture that supports consistent mask propagation across frames, exploits a novel backbone and feature-pyramid design, and underpins large-scale, interactive data engines for segmentation annotation. This marks a shift in segmentation foundation models from stateless per-frame masking to temporally coherent video object segmentation, with implications for annotation efficiency, real-time deployment, and future work in multi-object tracking, motion modeling, and context-aware segmentation (Geetha et al., 12 Aug 2024).

1. Architectural Advances and Model Pipeline

SAM-2 maintains the promptable foundation of SAM—handling points, boxes, masks, or text prompts—but substantially revises the pipeline to support video and temporal consistency. The architecture consists of an image encoder, a new memory encoder and memory bank, a prompt encoder, and a multi-source mask decoder.

Image Encoder (Backbone):
- SAM-1 uses a pre-trained Vision Transformer (ViT).
- SAM-2 replaces ViT with Hiera, a hierarchical masked-encoder transformer with a feature pyramid network (FPN).
- FPN component merges stride-16/32 features for coarse mask generation and injects stride-4/8 outputs for higher-resolution mask detail through skip-connections.
Prompt Encoder:
- Accepts both sparse (points, boxes) and dense (masks) prompts.
- Integrates positional encoding for sparse prompts and a convolutional mechanism for dense masks, enhancing their injection into the temporal attention pipeline.
Memory Encoder/Bank:
- Maintains a sliding window of key–value pairs representing prior frame embeddings and predicted masks.
- The current frame's feature embedding ( $E_t$ ) attends to memory bank entries via a four-layer transformer with both rotary and sinusoidal positional embeddings.
- After each decoding, new keys and values derived from current features and outputs augment the memory bank.
Mask Decoder:
- Receives: (1) memory-refined embeddings, (2) skip-connected high-resolution encoder features, and (3) prompt encodings.
- Produces a segmentation mask for the given frame, capable of leveraging both temporally and spatially proximate information for accuracy.

The model supports frame-level and full-video propagation modes, enforcing temporal mask consistency through memory attention and facilitating correction via user clicks at sparse intervals.

2. Temporal Memory Mechanism

The core innovation in SAM-2 is recurrent memory attention, enabling masks to be propagated and refined over long video sequences while maintaining efficiency.

Memory Bank ( $M_t$ ) Operation:
- Stores encoded key–value pairs from recent and initial reference frames.
- For the current frame $t$ , the embedding $E_t$ cross-attends to previous memory entries:
$\mathrm{Attn}(Q_t, K_{1:t-1}, V_{1:t-1}) = \mathrm{softmax}(Q_t K_{1:t-1}^T / \sqrt{d}) V_{1:t-1}$ - Upon mask decoding at frame $t$ , the new embedding and its associated mask token are appended to $M_t$ .
Sliding Window:
- The memory bank operates as a sliding temporal window, supporting both causal (forward-only) and acausal (forward/backward, offline) propagation modes.
- In offline annotation or refinement, this facilitates efficient correction and propagation along the entire video sequence.
Propagation and Correction:
- If new user prompts are introduced at a given frame, the system re-inferrs all subsequent (and possibly prior) frames, correcting masks as needed via memory attention.
- The memory mechanism enables annotation with minimal manual input, relying on prompt-based corrections only when transformer memory fails.

3. Dataset Construction and Supervision

SAM-2's capabilities are enabled by a large-scale, interactive data engine and comprehensive multi-phase dataset curation strategy.

Video Data (SA-V Dataset):
- 50.9K videos (average 14 s), comprising 642.6K “masklets” (per-frame masks).
- 451.7K masklets auto-generated, 190.9K manually screened.
- Various data augmentation procedures yielded an additional 69.6K masklets and 62.9K videos.
Annotation Protocol:

SAM-only initial phase: Per-frame auto generation, manual refinement; low efficiency (37.8 s/frame).
SAM1+SAM2 phase: SAM1 for key frames, SAM2 propagation, annotator click corrections, retraining; moderate efficiency (7.4 s/frame).
SAM2-only phase: Minimal human input with heavy reliance on temporal memory and prompt propagation; highest efficiency (4.5 s/frame).

Supervision:
- The only explicitly stated loss is the standard mask prediction loss from SAM-1 (no new temporal consistency or smoothness loss is specified).
- Temporal consistency is learned implicitly due to memory-based frame propagation during training.

4. Performance, Throughput, and Resource Characteristics

Quantitative performance analysis in (Geetha et al., 12 Aug 2024) is restricted to qualitative descriptors.

Efficiency:
- SAM-2 is reported to run at “near‐real‐time performance.”
- Substantial annotation speed-up is achieved in data engine phases, reflecting the impact of memory-propagation and improved architecture.
Resource Requirements:
- The Hiera backbone and efficient FPN/skip-connections balance the extra cost of the temporal module.
- No explicit hardware benchmarks, FPS curves, or comparative tables are provided, but pragmatic claims support deployment in real-time or interactive settings.
Deployment Considerations:
- Achieves practical throughput by limiting the memory bank’s window and efficient reuse of multi-scale features.
- Designed for interactive correction and annotation with minimal latency overhead compared to stateless (per-frame) segmentation.

5. Limitations and Failure Cases

Several bottlenecks and failure modes are identified in the documentation (Geetha et al., 12 Aug 2024).

Struggles with:
- Prolonged object occlusion.
- Heavy scene clutter.
- Rapid shot transitions.
- Distinguishing visually similar objects in close proximity.
- Multi-object coherence (objects are tracked independently, not contextually).
- Necessity for manual verification of intermediate frames, especially in challenging sequences.
No architectural mechanisms for:
- Explicit multi-object attention.
- Explicit temporal-smoothness or inter-frame mask regularization.

A plausible implication is that advanced applications requiring inter-object contextual reasoning, robust re-detection after occlusion, or fine-grained sequence regularity will need supplementary modules beyond those offered in SAM-2 as described.

6. Future Directions

The study indicates several avenues for improvement (Geetha et al., 12 Aug 2024):

Motion Modeling:
- Incorporation of explicit motion-compensation or re-identification modules to better handle occlusions and rapid disappearance/re-entry of objects.
Contextual Attention:
- Designing inter-object attention or context modules for coherent segmentation of multiple objects in close spatial or temporal proximity.
Automation and Annotation Reduction:
- Automated intermediate-frame annotation within the data engine to further reduce annotator burden and improve masklet consistency.
Temporal Regularization:
- Exploration of direct temporal-consistency losses or regularizers in the training objective to stabilize propagation and reduce drift.
Scalability and Robustness:
- Further optimizations in backbone selection, memory management, and mask-decoder scaling to support even higher-resolution video and more complex scenes.

In summary, SAM-2 extends image-only, prompt-based segmentation to interactive, memory-augmented video object segmentation by integrating a Hiera+FPN image encoder, a sliding-window memory attention module, and an efficient multi-source mask decoder. This design supports propagation of user-guided masks over extended video spans with minimal intervention, setting a foundation for further research at the intersection of promptable foundation models, real-time video segmentation, and scalable human-in-the-loop annotation (Geetha et al., 12 Aug 2024).

PDF Markdown Chat (Pro)

References (1)

From SAM to SAM 2: Exploring Improvements in Meta's Segment Anything Model (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Segment Anything Model 2 (SAM-2).