MobileSAMv2: Efficient Segmentation Framework

Updated 9 February 2026

MobileSAMv2 is a high-efficiency segmentation framework that unifies prompt-guided (SegAny) and fully automatic (SegEvery) segmentation using decoupled knowledge distillation and object-aware prompt sampling.
It replaces dense grid search with a lightweight YOLOv8 detector, reducing mask decoding time from 6400 ms to 97 ms per image while maintaining fine-grained segmentation quality.
By deploying a distilled lightweight encoder, MobileSAMv2 achieves a ~20× speedup with less than 3% IoU degradation, offering significant runtime improvements without sacrificing accuracy.

MobileSAMv2 is a segmentation framework focused on achieving high-efficiency and high-accuracy solutions for both prompt-based single-object segmentation (“segment anything”, SegAny) and fully automatic multi-object segmentation (“segment everything”, SegEvery). It addresses computational bottlenecks inherent in vanilla SAM, particularly in the image encoding and mask decoding stages, by combining decoupled knowledge distillation with an object-aware prompt sampling mechanism (Zhang et al., 2023).

1. Task Definitions and Bottleneck Analysis

The framework formalizes two distinct segmentation scenarios:

SegAny: Given an image $I$ and a user-supplied prompt (point, box, or mask), predict exactly one mask corresponding to the designated object.
SegEvery: Given an image $I$ with no external prompts, return masks for all object instances and parts present within the scene.

Standard SAM configurations exhibit notable inefficiencies:

In SegAny, computational load is dominated by the ViT-H image encoder ( $\sim450$ ms per image), while the lightweight mask decoder contributes minimally ( $\sim4$ ms).
For SegEvery, SAM employs a dense point-based grid ( $R \times R$ , commonly $64 \times 64$ ), running the mask decoder for 4096 prompts (often with three masks each), resulting in approximately 12,288 masks and a mask decoding stage exceeding 6 seconds per image. This stage, coupled with post-hoc filtering, becomes the primary bottleneck.

Task	Image Encoder	Mask Decoder
SegAny	450 ms	4 ms
SegEvery	450 ms	6,400 ms (64 $\times$ 64 grid)

This analysis isolates encoder- and decoder-related inefficiencies for targeted optimization.

2. Model Architecture and Knowledge Distillation

2.1 Decoupled Knowledge Distillation (MobileSAM v1)

The original SAM employs a heavy “teacher” encoder $E_t$ (ViT-H). To accelerate feature extraction without degrading mask quality, MobileSAM introduces a lighter “student” encoder $E_s$ (TinyViT or EfficientViT), trained via decoupled knowledge distillation while leaving the prompt encoder $P$ and mask decoder $D$ fixed.

Distillation proceeds by minimizing:

Feature-level discrepancy across selected layers $l$ :

$\mathcal{L}_{\rm feat} = \sum_{l} w_{l} \| E_s^{(l)}(I) - E_t^{(l)}(I) \|_2^2.$

Decoder-output loss under randomized prompts $p$ :

$\mathcal{L}_{\rm mask} = \mathbb{E}_{I,p} \| D(P(p), E_s(I)) - D(P(p), E_t(I)) \|_2^2.$

Total loss:

$\mathcal{L}_{\rm KD} = \lambda \mathcal{L}_{\rm feat} + \mu \mathcal{L}_{\rm mask}$

with $\lambda, \mu$ selected through validation.

MobileSAM achieves a $\sim20\times$ speedup in the encoder stage with less than 3% IoU degradation for SegAny.

2.2 Unified Deployment in MobileSAMv2

MobileSAMv2 retains the distilled encoder $E_s$ for both tasks. For SegEvery, a new object-aware prompt generation module is introduced post-embedding, but the prompt encoder $P$ and mask decoder $D$ remain unchanged. This yields a unified model supporting both user-prompted and fully automatic segmentation with consistent components.

3. Object-Aware Prompt Sampling via Detection

3.1 Object Detector Replacement of Grid Search

Vanilla SegEvery in SAM utilizes dense grid sampling of points as prompts, resulting in computational redundancy. MobileSAMv2 replaces this with a lightweight object detector (YOLOv8) to identify candidate object regions, which serve as input prompts.

Pseudo-code for object-aware prompt generation:

def SAMPLE_BOX_PROMPTS(I, DETECTOR, K, τ):
    B = DETECTOR(I)             # B: {(x1,y1,x2,y2; score)}
    B_prime = NON_MAX_SUPPRESSION(B, τ)
    return top_K_boxes_from(B_prime)

Parameters typically include

τ=0.5

for NMS and

K=320

limited box prompts. Each box is guaranteed to correspond to a candidate object region, circumventing the need for subsequent mask filtering.

3.2 Batch Decoding with Reduced Prompts

Each detected box prompt $b_i$ is processed through the prompt encoder and mask decoder:

$\mathrm{Mask}_i = D(P(b_i), E_s(I)), \quad i=1,\ldots,K$

With $K \ll R^2$ , the number of decoder calls and overall inference time are sharply reduced.

4. Theoretical and Empirical Efficiency Gains

4.1 Complexity and Runtime Advantages

Let $n$ denote the number of prompts and $C_D$ be the per-prompt computational cost. The overall mask decoding time is:

$T_{\mathrm{mask}} = n \cdot C_D + T_{\mathrm{filter}}$

For object-aware sampling, $T_{\mathrm{filter}}\approx0$ .

Grid-search: $n=R^2=4096$ , measured $T_{\mathrm{mask}}\approx6400$ ms.
Object-aware: $n=K=320$ , measured $T_{\mathrm{mask}}\approx50$ ms.

End-to-end, including detector overhead (≈47 ms), the mask stage runs in 97 ms versus 6464 ms—a ~66× acceleration. Using a smaller 32×32 grid, the speedup is ~16.6×.

The theoretical FLOPs for a cross-attention layer are:

$O((n+m)d^2 + nmd)$

where $m$ (image tokens) ≈ 1024, $d$ (hidden size) varies by layer.

4.2 Empirical Validation

On the LVIS subset for zero-shot object proposal:

Grid (64×64, multi-mask): AR@1000 = 59.2%
MobileSAMv2 (320 object boxes, single-mask): AR@1000 = 59.3%

Averaged over $K \in \{10, 100, 1000\}$ , MobileSAMv2 achieves 42.5% AR versus SAM’s 38.9% (a +3.6 point gain).

Method	AR@1000	AR@100	AR@10
SAM (64×64 grid, multi-mask)	59.2	44.8	12.6
MobileSAMv2 (320 boxes, single)	59.3	50.6	17.6

Qualitatively, masks are fine-grained and exhibit reduced over-segmentation, attributed to prompt localization on actual object regions. Alternatives eschewing prompts (e.g., FastSAM) offer even greater speed but show degraded boundary quality.

5. Framework Unification: A Single Model for SegAny and SegEvery

MobileSAMv2 consolidates SegAny and SegEvery within a single pipeline:

Compute an image embedding $\Phi = E_s(I)$ .
For SegAny, accept a user-supplied prompt $p$ ; for SegEvery, generate box prompts $\{b_i\}$ via object detection.
Batch the prompt encodings $P(\cdot)$ and apply the mask decoder $D$ as:

$M_i = D(P(\cdot), \Phi)$

No architecture changes or retraining are required for alternating between tasks—the same distilled encoder, prompt encoder, and mask decoder are utilized. This enables a fully unified deployment of both interactive and automatic segmentation modes.

6. Significance and Implications

MobileSAMv2 demonstrates that task-specific segmentation pipelines can be accelerated by identifying and addressing core inefficiencies (encoder for SegAny, decoder/prompt for SegEvery) via a combination of knowledge distillation and object-aware prompt generation. Notably, performance and mask quality are maintained or improved even under aggressive FLOPs and runtime reductions. A plausible implication is that future segmentation systems may further benefit from dynamic prompt selection strategies and tighter integration of lightweight detection modules with universal mask decoders (Zhang et al., 2023).

Markdown Upgrade to Chat

References (1)

MobileSAMv2: Faster Segment Anything to Everything (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MobileSAMv2.