Papers
Topics
Authors
Recent
Search
2000 character limit reached

MobileSAMv2: Efficient Segmentation Framework

Updated 9 February 2026
  • MobileSAMv2 is a high-efficiency segmentation framework that unifies prompt-guided (SegAny) and fully automatic (SegEvery) segmentation using decoupled knowledge distillation and object-aware prompt sampling.
  • It replaces dense grid search with a lightweight YOLOv8 detector, reducing mask decoding time from 6400 ms to 97 ms per image while maintaining fine-grained segmentation quality.
  • By deploying a distilled lightweight encoder, MobileSAMv2 achieves a ~20× speedup with less than 3% IoU degradation, offering significant runtime improvements without sacrificing accuracy.

MobileSAMv2 is a segmentation framework focused on achieving high-efficiency and high-accuracy solutions for both prompt-based single-object segmentation (“segment anything”, SegAny) and fully automatic multi-object segmentation (“segment everything”, SegEvery). It addresses computational bottlenecks inherent in vanilla SAM, particularly in the image encoding and mask decoding stages, by combining decoupled knowledge distillation with an object-aware prompt sampling mechanism (Zhang et al., 2023).

1. Task Definitions and Bottleneck Analysis

The framework formalizes two distinct segmentation scenarios:

  • SegAny: Given an image II and a user-supplied prompt (point, box, or mask), predict exactly one mask corresponding to the designated object.
  • SegEvery: Given an image II with no external prompts, return masks for all object instances and parts present within the scene.

Standard SAM configurations exhibit notable inefficiencies:

  • In SegAny, computational load is dominated by the ViT-H image encoder (450\sim450 ms per image), while the lightweight mask decoder contributes minimally (4\sim4 ms).
  • For SegEvery, SAM employs a dense point-based grid (R×RR \times R, commonly 64×6464 \times 64), running the mask decoder for 4096 prompts (often with three masks each), resulting in approximately 12,288 masks and a mask decoding stage exceeding 6 seconds per image. This stage, coupled with post-hoc filtering, becomes the primary bottleneck.
Task Image Encoder Mask Decoder
SegAny 450 ms 4 ms
SegEvery 450 ms 6,400 ms (64×\times64 grid)

This analysis isolates encoder- and decoder-related inefficiencies for targeted optimization.

2. Model Architecture and Knowledge Distillation

2.1 Decoupled Knowledge Distillation (MobileSAM v1)

The original SAM employs a heavy “teacher” encoder EtE_t (ViT-H). To accelerate feature extraction without degrading mask quality, MobileSAM introduces a lighter “student” encoder EsE_s (TinyViT or EfficientViT), trained via decoupled knowledge distillation while leaving the prompt encoder PP and mask decoder DD fixed.

Distillation proceeds by minimizing:

  • Feature-level discrepancy across selected layers ll:

Lfeat=lwlEs(l)(I)Et(l)(I)22.\mathcal{L}_{\rm feat} = \sum_{l} w_{l} \| E_s^{(l)}(I) - E_t^{(l)}(I) \|_2^2.

  • Decoder-output loss under randomized prompts pp:

Lmask=EI,pD(P(p),Es(I))D(P(p),Et(I))22.\mathcal{L}_{\rm mask} = \mathbb{E}_{I,p} \| D(P(p), E_s(I)) - D(P(p), E_t(I)) \|_2^2.

  • Total loss:

LKD=λLfeat+μLmask\mathcal{L}_{\rm KD} = \lambda \mathcal{L}_{\rm feat} + \mu \mathcal{L}_{\rm mask}

with λ,μ\lambda, \mu selected through validation.

MobileSAM achieves a 20×\sim20\times speedup in the encoder stage with less than 3% IoU degradation for SegAny.

2.2 Unified Deployment in MobileSAMv2

MobileSAMv2 retains the distilled encoder EsE_s for both tasks. For SegEvery, a new object-aware prompt generation module is introduced post-embedding, but the prompt encoder PP and mask decoder DD remain unchanged. This yields a unified model supporting both user-prompted and fully automatic segmentation with consistent components.

3. Object-Aware Prompt Sampling via Detection

Vanilla SegEvery in SAM utilizes dense grid sampling of points as prompts, resulting in computational redundancy. MobileSAMv2 replaces this with a lightweight object detector (YOLOv8) to identify candidate object regions, which serve as input prompts.

Pseudo-code for object-aware prompt generation:

1
2
3
4
def SAMPLE_BOX_PROMPTS(I, DETECTOR, K, τ):
    B = DETECTOR(I)             # B: {(x1,y1,x2,y2; score)}
    B_prime = NON_MAX_SUPPRESSION(B, τ)
    return top_K_boxes_from(B_prime)
Parameters typically include τ=0.5τ=0.5 for NMS and K=320K=320 limited box prompts. Each box is guaranteed to correspond to a candidate object region, circumventing the need for subsequent mask filtering.

3.2 Batch Decoding with Reduced Prompts

Each detected box prompt bib_i is processed through the prompt encoder and mask decoder:

Maski=D(P(bi),Es(I)),i=1,,K\mathrm{Mask}_i = D(P(b_i), E_s(I)), \quad i=1,\ldots,K

With KR2K \ll R^2, the number of decoder calls and overall inference time are sharply reduced.

4. Theoretical and Empirical Efficiency Gains

4.1 Complexity and Runtime Advantages

Let nn denote the number of prompts and CDC_D be the per-prompt computational cost. The overall mask decoding time is:

Tmask=nCD+TfilterT_{\mathrm{mask}} = n \cdot C_D + T_{\mathrm{filter}}

For object-aware sampling, Tfilter0T_{\mathrm{filter}}\approx0.

  • Grid-search: n=R2=4096n=R^2=4096, measured Tmask6400T_{\mathrm{mask}}\approx6400 ms.
  • Object-aware: n=K=320n=K=320, measured Tmask50T_{\mathrm{mask}}\approx50 ms.

End-to-end, including detector overhead (≈47 ms), the mask stage runs in 97 ms versus 6464 ms—a ~66× acceleration. Using a smaller 32×32 grid, the speedup is ~16.6×.

The theoretical FLOPs for a cross-attention layer are:

O((n+m)d2+nmd)O((n+m)d^2 + nmd)

where mm (image tokens) ≈ 1024, dd (hidden size) varies by layer.

4.2 Empirical Validation

On the LVIS subset for zero-shot object proposal:

  • Grid (64×64, multi-mask): AR@1000 = 59.2%
  • MobileSAMv2 (320 object boxes, single-mask): AR@1000 = 59.3%

Averaged over K{10,100,1000}K \in \{10, 100, 1000\}, MobileSAMv2 achieves 42.5% AR versus SAM’s 38.9% (a +3.6 point gain).

Method AR@1000 AR@100 AR@10
SAM (64×64 grid, multi-mask) 59.2 44.8 12.6
MobileSAMv2 (320 boxes, single) 59.3 50.6 17.6

Qualitatively, masks are fine-grained and exhibit reduced over-segmentation, attributed to prompt localization on actual object regions. Alternatives eschewing prompts (e.g., FastSAM) offer even greater speed but show degraded boundary quality.

5. Framework Unification: A Single Model for SegAny and SegEvery

MobileSAMv2 consolidates SegAny and SegEvery within a single pipeline:

  1. Compute an image embedding Φ=Es(I)\Phi = E_s(I).
  2. For SegAny, accept a user-supplied prompt pp; for SegEvery, generate box prompts {bi}\{b_i\} via object detection.
  3. Batch the prompt encodings P()P(\cdot) and apply the mask decoder DD as:

Mi=D(P(),Φ)M_i = D(P(\cdot), \Phi)

No architecture changes or retraining are required for alternating between tasks—the same distilled encoder, prompt encoder, and mask decoder are utilized. This enables a fully unified deployment of both interactive and automatic segmentation modes.

6. Significance and Implications

MobileSAMv2 demonstrates that task-specific segmentation pipelines can be accelerated by identifying and addressing core inefficiencies (encoder for SegAny, decoder/prompt for SegEvery) via a combination of knowledge distillation and object-aware prompt generation. Notably, performance and mask quality are maintained or improved even under aggressive FLOPs and runtime reductions. A plausible implication is that future segmentation systems may further benefit from dynamic prompt selection strategies and tighter integration of lightweight detection modules with universal mask decoders (Zhang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MobileSAMv2.