Segment Anything Model 3 (SAM3)

Updated 21 November 2025

SAM3 is a unified model for promptable concept segmentation across static images and video sequences, integrating detection and tracking components.
It combines a high-capacity vision backbone, DETR-style promptable detection, and dense-memory video tracking to process multi-modal prompts efficiently.
The Progressive Hierarchical Distillation framework enables lightweight student models that achieve real-time on-device performance with minimal accuracy loss.

The Segment Anything Model 3 (SAM3) introduces a unified architecture for Promptable Concept Segmentation (PCS) across both static images and video sequences, extending the "Segment Anything" paradigm from generic object masks to open-vocabulary, prompt-driven segmentation via natural language noun phrases or image exemplars. SAM3 incorporates a shared vision backbone, a DETR-style promptable detector, and a dense-memory video tracker, achieving strong segmentation and tracking performance, but with a computational footprint unsuited to on-device deployment. EfficientSAM3 leverages Progressive Hierarchical Distillation (PHD), transferring SAM3's capabilities into a spectrum of lightweight student models without severe accuracy degradation, thus enabling real-time, hardware-constrained deployment (Zeng et al., 19 Nov 2025).

1. Unified Architecture for Promptable Concept Segmentation

SAM3's architecture integrates three tightly-coupled components optimized for PCS:

Shared Vision Backbone: A high-capacity Vision Transformer (ViT-H) extracts spatial features from the input image $I_t$ :

$F_t = E_{\mathrm{vision}}(I_t) \in \mathbb{R}^{C \times H \times W}$

DETR-Style Concept Detector: The concept prompt $P_c$ (text tokenized or as an exemplar embedding) is converted to continuous queries. A set of $N$ object queries $Q \in \mathbb{R}^{N \times d}$ jointly attends to vision features $F_t$ and prompt $P_c$ using multi-head cross-attention. Each query yields: (1) a presence probability $p_i$ , (2) bounding box $b_i \in \mathbb{R}^4$ , and (3) mask logits $M_i \in [0,1]^{H \times W}$ . Losses comprise standard Hungarian-matched detection loss and mask-specific Dice and Focal components to optimize:

$\mathcal{L}_{\mathrm{det}} = \sum_{i=1}^N \left[\mathcal{L}_{\mathrm{cls}} + \lambda_b \|b_i - b^*_i\|_1 + \lambda_{\mathrm{giou}}(1-\mathrm{GIoU})\right]$

$\mathcal{L}_{\mathrm{mask}} = \sum_{i=1}^N [\mathrm{Dice}(M_i, M^*_i) + \mathrm{Focal}(M_i, M^*_i)]$

Dense-Memory Video Tracker: SAM3 tracks each object $o$ by maintaining a first-class memory bank $\mathcal{B}_t^o$ of appearance features and masks. At each frame, the memory is updated and used, with spatiotemporal attention, by the tracker module for mask prediction:

$\mathcal{B}_t^o = \mathrm{Update}\left(\mathcal{B}_{t-1}^o, E_{\mathrm{mem}}(F_t, M_t^o)\right)$

$\hat{M}^o_{t+1} = D_{\mathrm{track}}(E_{\mathrm{vision}}(I_{t+1}), P_c, \mathcal{B}_t^o)$

Both prompt cross-attention and dense memory attention scale as $\mathcal{O}(HW \cdot |\mathcal{B}|)$ , making inference expensive for long sequences and large spatial inputs.

2. Progressive Hierarchical Distillation (PHD)

EfficientSAM3 deploys a three-stage PHD curriculum to distill SAM3's representation and operational fidelity into compact student models:

Stage 1 — Encoder Distillation: The ViT-H backbone is distilled into a compact student encoder (RepViT, TinyViT, or EfficientViT), supervised "prompt-in-the-loop" on the SA-1B dataset. The objectives align intermediate feature spaces and directly match output masks:

$\mathcal{L}_{\mathrm{enc}} = \|\mathrm{Proj}(F^S) - F^T\|_2^2, \quad \mathcal{L}_{\mathrm{mask}}^{\mathrm{KD}} = \sum_i \left[\mathrm{Dice}(M^S_i, M^T_{\sigma(i)}) + \mathrm{Focal}(M^S_i, M^T_{\sigma(i)})\right]$

Combined with original task loss:

$\mathcal{L}_{\mathrm{stage1}} = \mathcal{L}_{\mathrm{det}}^{\mathrm{task}} + \lambda_1 \mathcal{L}_{\mathrm{enc}} + \lambda_2 \mathcal{L}_{\mathrm{mask}}^{\mathrm{KD}}$

Stage 2 — Temporal Memory Distillation: The resource-intensive dense-memory module is replaced by a Perceiver-based compact memory bank of $K$ learnable latents. Using cross-attention, it compresses memory features for efficient tracking:

$F_{\mathrm{comp}} = \mathrm{softmax}\left(\frac{QW_Q (F_{\mathrm{flat}} W_K)^\top}{\sqrt{d_k}}\right) (F_{\mathrm{flat}} W_V)$

Supervisory losses match predicted masks and latent readouts between teacher and student across frames.

Stage 3 — End-to-End Fine-Tuning: The entire student model is jointly fine-tuned with multi-modal prompts on the official PCS dataset (SA-Co), incorporating both image and video clips. The loss includes concept-aware BCE, box/mask losses, memory supervision, and prompt-conditioned knowledge distillation.

A key feature of PHD is "prompt-in-the-loop" distillation at every stage, ensuring transfer of both visual and prompt-conditioned behaviors.

3. Student Model Variants and Efficiency–Accuracy Spectrum

EfficientSAM3-PHD defines nine student variants by pairing three backbone architectures with three size/parameter regimes. The following table summarizes variant families, their sizes, and on-device runtime characteristics:

Family	Model	Params (M)	Jetson NX FPS
RepViT	ES-RV-S	5.1	—
RepViT	ES-RV-M	6.8	—
RepViT	ES-RV-L	8.2	18
TinyViT	ES-TV-S	5.4	—
TinyViT	ES-TV-M	11	25
TinyViT	ES-TV-L	21	—
EfficientViT	ES-EV-S	0.7	60
EfficientViT	ES-EV-M	4.8	—
EfficientViT	ES-EV-L	15	30

Note: Some FPS values provided are illustrative highlights.

RepViT employs depthwise convolutions and structural re-parameterization targeting fast mobile NPU execution.
TinyViT is a lightweight vision transformer with distilled attention.
EfficientViT replaces standard self-attention with multi-scale, linearized attention for efficient high-resolution processing.

On-device speed scales inversely with model size and computational footprint. The ES-EV-S variant achieves ~60 FPS with a ∼10% drop in boundary accuracy (compared to the teacher); ES-RV-L delivers quality within 2–3% of the teacher (J&F) at real-time rates (~18–20 FPS). Mid-tier variants (ES-TV-M, ES-EV-M) provide balanced performance (25–30 FPS, <5% drop) (Zeng et al., 19 Nov 2025).

4. Empirical Evaluation and Benchmarks

Proposed evaluation covers standard video object segmentation (VOS) benchmarks: DAVIS17, YouTube-VOS 2019, MOSE, and SA-V. Performance is measured via:

Region similarity ( $\mathcal{J}$ , IoU)
Boundary F-score ( $\mathcal{F}$ )
Combined score ( $\mathcal{G}=(\mathcal{J}+\mathcal{F})/2$ )
Concept-grounded F1 for open-vocabulary prompts
Inference speed on edge hardware (Jetson NX, A-series iPhone)

Illustrative performance–efficiency trade-offs demonstrate that PHD students maintain strong performance, losing only a few percentage points in $\mathcal{G}$ while gaining 3–12× throughput. For example:

Model	DAVIS17 $\mathcal{G}$	YTVOS $\mathcal{G}$	FPS (Jetson NX)
SAM3	90.1	88.5	5
ES-RV-L	87.6	85.3	18
ES-TV-M	85.2	83.0	25
ES-EV-L	86.3	84.1	30
ES-EV-S	78.9	75.4	60

These results establish PHD's ability to preserve most of SAM3’s open-vocabulary segmentation and tracking strength while enabling real-time edge inference (Zeng et al., 19 Nov 2025).

5. Insights, Limitations, and Prospective Directions

SAM3 unifies object detection, segmentation, and robust long-term tracking, all governed by open-vocabulary, multi-modal prompts within a single computational graph. PHD addresses the two key sources of computational expense—heavy vision backbones and dense memory—by methodically distilling promptable concept behavior and temporal memory into smaller, hardware-adapted designs.

However, efficiency comes at a cost. The smallest EfficientSAM3 variants underperform on fine boundary localization and rare concept classes. Perceiver-based memory modules can lose spatial detail over extended sequences, potentially affecting long-horizon video coherence. The current approach does not leverage quantization or pruning strategies, which could further reduce memory and computation.

Planned next steps include:

Adopting mixed-precision quantization and structured pruning to achieve sub-1MB model footprints.
Investigating state-space and Mamba-style memory modules for scalable, linear-time sequence modeling.
Employing multi-teacher and contrastive distillation for robustness to ambiguous prompts and hard negatives.
Enabling real-time, interactive prompt refinement on device, targeting use cases in augmented reality and robotics.
Integrating LLM co-training for richer, compositional concept grounding (Zeng et al., 19 Nov 2025).

6. Context and Significance in Segmentation Research

SAM3 establishes a new standard in promptable open-vocabulary segmentation and tracking by transferring segmentation from static categories to real-time, concept-driven interaction across modalities and time. The PHD framework systematically bridges the gap between foundation models optimized for server-scale hardware and practical deployment in latency- and memory-constrained environments. This suggests a pathway for future segmentation research: integrating multi-modal promptability, temporal coherence, and on-device tractability in a unified, scalable architecture.

PDF Markdown Chat (Pro)

References (1)

EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3 (2025)

Follow Topic

Get notified by email when new papers are published related to Segment Anything Model 3 (SAM3).