MobileSAM: Lightweight Mobile Segmentation

Updated 16 December 2025

MobileSAM is a lightweight segmentation model designed for mobile and edge devices, using a TinyViT encoder and decoupled knowledge distillation to replace SAM’s heavy Vision Transformer.
It maintains compatibility with SAM’s prompt-based pipeline by reusing the original mask decoder and supporting point, box, and text prompts without retraining.
Benchmark results show MobileSAM is up to 66× smaller and 5–38× faster than SAM, achieving nearly matching segmentation quality on standard datasets.

MobileSAM is a lightweight variant of the Segment Anything Model (SAM), specifically engineered for high-throughput image segmentation on resource-constrained mobile and edge devices. MobileSAM rearchitects the SAM by swapping out its computationally intensive Vision Transformer (ViT-H, ∼632M parameters) in favor of a compact ViT-Tiny backbone (∼5.78M parameters) through a novel decoupled knowledge distillation scheme. This enables mobile/CPU inference at significant speed and memory advantages, with segmentation performance nearly matching the original SAM on common benchmarks (Zhang et al., 2023).

1. Architectural Foundations

MobileSAM replaces SAM’s ViT-H encoder with a TinyViT structure, retaining the original prompt encoder and mask decoder—ensuring full compatibility with the standard mask-prediction pipeline. The TinyViT backbone consists of MobileNet-style inverted residual blocks (early downsampling) followed by multi-stage transformer layers and depthwise convolutional downsampling. Spatial resolution reduction is fixed at 1/16 in height and width. The parameter budget drops from ≈632M in SAM to ≈5.78M in MobileSAM’s encoder; total model size is ≈9.66M including the unchanged prompt encoder and mask decoder.

All prompt modes supported by SAM (point, box, text) are compatible with MobileSAM, relying on the same feature manifold due to the distillation strategy. The mask decoder is reused without retraining or architecture change.

2. Decoupled Distillation Training

Naïve knowledge distillation of small encoders with prompt-based mask decoders results in inferior accuracy due to the coupled optimization of encoder and decoder. MobileSAM introduces decoupled distillation: the lightweight encoder is trained to match the teacher’s embedding (ViT-H output) via a simple mean-squared error (MSE) loss, bypassing the mask decoder during optimization:

$L_{\mathrm{feature}}(x) = \Vert f_T(x) - f_S(x) \Vert_2^2$

where $f_T(x)$ is the teacher (ViT-H) embedding, $f_S(x)$ is the student (TinyViT) embedding. No mask/prompt loss terms are applied; prompts are omitted during distillation. Batch size, learning rate, and scheduler settings are standard Transformer conventions (AdamW, LR≈1e-4–3e-4, cosine annealing).

The result is a student encoder whose output lives on the same feature manifold as the teacher, allowing it to be a drop-in replacement for the original decoder. Training can be performed on a single high-end GPU in <24 hours using only a subset of SA-1B (as little as 0.1–1% of the data).

3. Benchmark Performance and Comparative Analysis

MobileSAM achieves drastic improvements in both model size and latency compared to both SAM and alternative mobile segmentation frameworks. Key metrics summarized below (Zhang et al., 2023):

Model	Params (M)	GPU Speed (ms)	mIoU (vs. SAM)	Notes
SAM (ViT-H)	636	452	1.00 (ref)	gold masks
FastSAM	68	40	0.27	YOLOv8 + YOLACT
MobileSAM	9.66	12	0.73–0.74	TinyViT encoder

MobileSAM is ≈66× smaller than SAM, ≈5–38× faster, and its segmentation quality (mIoU≈0.74) is only ~2–4 points lower than SAM’s full model. Performance is also superior to FastSAM and nearly matches SAM on mobile benchmarks. CPU inference runs “relatively smoothly,” and no numerical degradation is observed for real-world applications.

Specific applications confirm these findings:

Fire segmentation: MobileSAM achieves mIoU=0.659 on the Roboflow fire dataset, with ≈4.83 FPS and a memory footprint of 346MB (Ugwu et al., 18 Oct 2025).
Wildlife segmentation: On the challenging Houbara Bustard dataset, threading YOLOv10 with MobileSAM yields mIoU=0.7421, with per-frame latency ≈107.5ms—enabling real-time conservation monitoring on NVIDIA Jetson AGX Xavier (Saoud et al., 3 Oct 2025).
Satellite onboard segmentation: MobileSAM runs onboard Unibap iX10-100, with distributed fine-tuning across satellite constellations. Rapid adaptation to disaster scenes improves IoU from 0.47 to 0.69 within six hours of federated learning (Plumridge et al., 26 Nov 2024).

4. Advances in Prompting and Segmentation Modes

MobileSAMv2 (Zhang et al., 2023) extends MobileSAM to fast “segment everything” (SegEvery) inference. Instead of the SAM’s dense grid prompt sampling (e.g., 64×64 grid), MobileSAMv2 employs object-aware prompt selection using YOLOv8 bounding boxes and non-maximum suppression. This scheme achieves mask decoding speedups of 16–128× compared to vanilla SAM, with qualitative gains in average recall (+3.6 pp AR@K on LVIS). SegAny and SegEvery modes are unified, with both benefiting from the same lightweight encoder and decoder.

Prompting strategies using bounding-box and hybrid point approaches have also been demonstrated to be optimal for MobileSAM in fire segmentation and edge scenarios (Ugwu et al., 18 Oct 2025). Object detection models (YOLO variants) now routinely precede MobileSAM in detection-segmentation pipelines.

5. Competing and Complementary Models

TinySAM (Shu et al., 2023) and Group-Mix SAM (Liang et al., 15 Mar 2024) both build upon MobileSAM’s core recipe—distilling compact image encoders for mask-decoder compatibility—but deploy alternative architectures and quantization:

TinySAM distills a TinyViT-5M encoder, applies hierarchical “segment everything” acceleration, and supports 8-bit post-training quantization. It consistently improves COCO and LVIS AP by ~1–2 points over MobileSAM, halves FLOPs/latency after quantization, and operates within 60MB (FP32) or 30MB (int8).
Group-Mix SAM uses a GroupMixFormer backbone, achieving a further 37.6% reduction in parameters and 42.5% FLOPS over MobileSAM, with only a minor (0.8 pp) mIoU penalty on industrial datasets (Liang et al., 15 Mar 2024).

RepViT-SAM (Wang et al., 2023) further reduces computation and latency by employing a purely convolutional network (RepViT), distilling with decoupled MSE. It demonstrates ≈10× faster inference vs. MobileSAM on Apple M1 hardware and outperforms MobileSAM by 1–3 points in AP and J&F, at similar memory budgets.

6. Practical Deployment and Integration

MobileSAM is engineered for plug-and-play operation within SAM-codebases: the encoder swap is structurally drop-in, with no weight adaptation or decoder fine-tuning required in most cases. Suggested optimizations include encoder quantization to 8-bit, ONNX/TensorRT conversion for mobile inference, vectorized batched mask decoding, and precomputed embedding caches for repeated prompt scenarios.

In production settings, MobileSAM’s ability to run at ≈10–12 ms/image on GPU and <300 ms/image on ARM CPUs, with <50 MB RAM footprint, makes it suited for video streams, real-time editing in computational photography, edge deployment in industrial and wildlife monitoring, and distributed learning under bandwidth/thermal constraints (Plumridge et al., 26 Nov 2024, Saoud et al., 3 Oct 2025). No additional pruning or quantization is required for baseline deployment; advanced scenarios employ further compression and NPU kernels for lower latency.

7. Impact and Future Directions

MobileSAM has triggered a rethinking of promptable segmentation for mobile vision. The decoupled distillation paradigm enables rapid adaptation of large-scale vision models while preserving compatibility with original decoders and prompt engines. Models such as MobileSAMv2, TinySAM, RepViT-SAM, and Group-Mix SAM elaborate this recipe, further optimizing for latency, memory usage, and application-specific accuracy.

Ongoing research explores distributed/federated adaptation on specialized hardware (e.g., satellites), domain-specific prompting strategies (e.g., bounding-box/grid for fire or wildlife detection), and integration with real-time detection for composite vision pipelines. The convergence of deep knowledge distillation, lightweight architectures, and efficient prompt mechanisms positions MobileSAM and its derivatives as foundational components for real-time segmentation across mobile, edge, and embedded environments (Zhang et al., 2023, Zhang et al., 2023, Shu et al., 2023, Wang et al., 2023, Liang et al., 15 Mar 2024, Plumridge et al., 26 Nov 2024, Saoud et al., 3 Oct 2025, Ugwu et al., 18 Oct 2025).