Papers
Topics
Authors
Recent
Search
2000 character limit reached

Segment Anything Model 3 (SAM3)

Updated 21 November 2025
  • SAM3 is a unified model for promptable concept segmentation across static images and video sequences, integrating detection and tracking components.
  • It combines a high-capacity vision backbone, DETR-style promptable detection, and dense-memory video tracking to process multi-modal prompts efficiently.
  • The Progressive Hierarchical Distillation framework enables lightweight student models that achieve real-time on-device performance with minimal accuracy loss.

The Segment Anything Model 3 (SAM3) introduces a unified architecture for Promptable Concept Segmentation (PCS) across both static images and video sequences, extending the "Segment Anything" paradigm from generic object masks to open-vocabulary, prompt-driven segmentation via natural language noun phrases or image exemplars. SAM3 incorporates a shared vision backbone, a DETR-style promptable detector, and a dense-memory video tracker, achieving strong segmentation and tracking performance, but with a computational footprint unsuited to on-device deployment. EfficientSAM3 leverages Progressive Hierarchical Distillation (PHD), transferring SAM3's capabilities into a spectrum of lightweight student models without severe accuracy degradation, thus enabling real-time, hardware-constrained deployment (Zeng et al., 19 Nov 2025).

1. Unified Architecture for Promptable Concept Segmentation

SAM3's architecture integrates three tightly-coupled components optimized for PCS:

  • Shared Vision Backbone: A high-capacity Vision Transformer (ViT-H) extracts spatial features from the input image ItI_t:

Ft=Evision(It)∈RC×H×WF_t = E_{\mathrm{vision}}(I_t) \in \mathbb{R}^{C \times H \times W}

  • DETR-Style Concept Detector: The concept prompt PcP_c (text tokenized or as an exemplar embedding) is converted to continuous queries. A set of NN object queries Q∈RN×dQ \in \mathbb{R}^{N \times d} jointly attends to vision features FtF_t and prompt PcP_c using multi-head cross-attention. Each query yields: (1) a presence probability pip_i, (2) bounding box bi∈R4b_i \in \mathbb{R}^4, and (3) mask logits Mi∈[0,1]H×WM_i \in [0,1]^{H \times W}. Losses comprise standard Hungarian-matched detection loss and mask-specific Dice and Focal components to optimize:

Ft=Evision(It)∈RC×H×WF_t = E_{\mathrm{vision}}(I_t) \in \mathbb{R}^{C \times H \times W}0

Ft=Evision(It)∈RC×H×WF_t = E_{\mathrm{vision}}(I_t) \in \mathbb{R}^{C \times H \times W}1

  • Dense-Memory Video Tracker: SAM3 tracks each object Ft=Evision(It)∈RC×H×WF_t = E_{\mathrm{vision}}(I_t) \in \mathbb{R}^{C \times H \times W}2 by maintaining a first-class memory bank Ft=Evision(It)∈RC×H×WF_t = E_{\mathrm{vision}}(I_t) \in \mathbb{R}^{C \times H \times W}3 of appearance features and masks. At each frame, the memory is updated and used, with spatiotemporal attention, by the tracker module for mask prediction:

Ft=Evision(It)∈RC×H×WF_t = E_{\mathrm{vision}}(I_t) \in \mathbb{R}^{C \times H \times W}4

Ft=Evision(It)∈RC×H×WF_t = E_{\mathrm{vision}}(I_t) \in \mathbb{R}^{C \times H \times W}5

Both prompt cross-attention and dense memory attention scale as Ft=Evision(It)∈RC×H×WF_t = E_{\mathrm{vision}}(I_t) \in \mathbb{R}^{C \times H \times W}6, making inference expensive for long sequences and large spatial inputs.

2. Progressive Hierarchical Distillation (PHD)

EfficientSAM3 deploys a three-stage PHD curriculum to distill SAM3's representation and operational fidelity into compact student models:

  • Stage 1 — Encoder Distillation: The ViT-H backbone is distilled into a compact student encoder (RepViT, TinyViT, or EfficientViT), supervised "prompt-in-the-loop" on the SA-1B dataset. The objectives align intermediate feature spaces and directly match output masks:

Ft=Evision(It)∈RC×H×WF_t = E_{\mathrm{vision}}(I_t) \in \mathbb{R}^{C \times H \times W}7

Combined with original task loss:

Ft=Evision(It)∈RC×H×WF_t = E_{\mathrm{vision}}(I_t) \in \mathbb{R}^{C \times H \times W}8

  • Stage 2 — Temporal Memory Distillation: The resource-intensive dense-memory module is replaced by a Perceiver-based compact memory bank of Ft=Evision(It)∈RC×H×WF_t = E_{\mathrm{vision}}(I_t) \in \mathbb{R}^{C \times H \times W}9 learnable latents. Using cross-attention, it compresses memory features for efficient tracking:

PcP_c0

Supervisory losses match predicted masks and latent readouts between teacher and student across frames.

  • Stage 3 — End-to-End Fine-Tuning: The entire student model is jointly fine-tuned with multi-modal prompts on the official PCS dataset (SA-Co), incorporating both image and video clips. The loss includes concept-aware BCE, box/mask losses, memory supervision, and prompt-conditioned knowledge distillation.

A key feature of PHD is "prompt-in-the-loop" distillation at every stage, ensuring transfer of both visual and prompt-conditioned behaviors.

3. Student Model Variants and Efficiency–Accuracy Spectrum

EfficientSAM3-PHD defines nine student variants by pairing three backbone architectures with three size/parameter regimes. The following table summarizes variant families, their sizes, and on-device runtime characteristics:

Family Model Params (M) Jetson NX FPS
RepViT ES-RV-S 5.1 —
RepViT ES-RV-M 6.8 —
RepViT ES-RV-L 8.2 18
TinyViT ES-TV-S 5.4 —
TinyViT ES-TV-M 11 25
TinyViT ES-TV-L 21 —
EfficientViT ES-EV-S 0.7 60
EfficientViT ES-EV-M 4.8 —
EfficientViT ES-EV-L 15 30

Note: Some FPS values provided are illustrative highlights.

  • RepViT employs depthwise convolutions and structural re-parameterization targeting fast mobile NPU execution.
  • TinyViT is a lightweight vision transformer with distilled attention.
  • EfficientViT replaces standard self-attention with multi-scale, linearized attention for efficient high-resolution processing.

On-device speed scales inversely with model size and computational footprint. The ES-EV-S variant achieves ~60 FPS with a ∼10% drop in boundary accuracy (compared to the teacher); ES-RV-L delivers quality within 2–3% of the teacher (J&F) at real-time rates (~18–20 FPS). Mid-tier variants (ES-TV-M, ES-EV-M) provide balanced performance (25–30 FPS, <5% drop) (Zeng et al., 19 Nov 2025).

4. Empirical Evaluation and Benchmarks

Proposed evaluation covers standard video object segmentation (VOS) benchmarks: DAVIS17, YouTube-VOS 2019, MOSE, and SA-V. Performance is measured via:

  • Region similarity (PcP_c1, IoU)
  • Boundary F-score (PcP_c2)
  • Combined score (PcP_c3)
  • Concept-grounded F1 for open-vocabulary prompts
  • Inference speed on edge hardware (Jetson NX, A-series iPhone)

Illustrative performance–efficiency trade-offs demonstrate that PHD students maintain strong performance, losing only a few percentage points in PcP_c4 while gaining 3–12× throughput. For example:

Model DAVIS17 PcP_c5 YTVOS PcP_c6 FPS (Jetson NX)
SAM3 90.1 88.5 5
ES-RV-L 87.6 85.3 18
ES-TV-M 85.2 83.0 25
ES-EV-L 86.3 84.1 30
ES-EV-S 78.9 75.4 60

These results establish PHD's ability to preserve most of SAM3’s open-vocabulary segmentation and tracking strength while enabling real-time edge inference (Zeng et al., 19 Nov 2025).

5. Insights, Limitations, and Prospective Directions

SAM3 unifies object detection, segmentation, and robust long-term tracking, all governed by open-vocabulary, multi-modal prompts within a single computational graph. PHD addresses the two key sources of computational expense—heavy vision backbones and dense memory—by methodically distilling promptable concept behavior and temporal memory into smaller, hardware-adapted designs.

However, efficiency comes at a cost. The smallest EfficientSAM3 variants underperform on fine boundary localization and rare concept classes. Perceiver-based memory modules can lose spatial detail over extended sequences, potentially affecting long-horizon video coherence. The current approach does not leverage quantization or pruning strategies, which could further reduce memory and computation.

Planned next steps include:

6. Context and Significance in Segmentation Research

SAM3 establishes a new standard in promptable open-vocabulary segmentation and tracking by transferring segmentation from static categories to real-time, concept-driven interaction across modalities and time. The PHD framework systematically bridges the gap between foundation models optimized for server-scale hardware and practical deployment in latency- and memory-constrained environments. This suggests a pathway for future segmentation research: integrating multi-modal promptability, temporal coherence, and on-device tractability in a unified, scalable architecture.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Segment Anything Model 3 (SAM3).