EdgeSAM: Real-Time Edge Segmentation

Updated 16 December 2025

EdgeSAM is an interactive segmentation framework that compresses SAM’s transformer-heavy encoder to a lightweight CNN-based backbone while maintaining comparable accuracy.
It employs prompt-in-the-loop distillation and tailored memory attention optimizations to achieve real-time processing (30+ FPS) on smartphones, Jetson GPUs, and ARM devices.
EdgeSAM extends to specialized domains like crack segmentation and video analytics, offering efficient, on-device integration for practical, resource-limited applications.

EdgeSAM is a family of methods and model architectures designed to realize real-time, high-accuracy interactive segmentation on resource-constrained devices, distilling the core capabilities of the Segment Anything Model (SAM) into compact, mobile-optimized forms. The defining feature of EdgeSAM is its ability to deliver segmentation accuracy comparable to foundation models such as SAM or SAM 2, executing at frame rates sufficient for smooth user or video interaction—typically 30 FPS or higher—on contemporary edge hardware, including smartphones, Jetson-class edge GPUs, and low-power ARM devices. EdgeSAM builds upon CNN-based architectural reductions of transformer-heavy backbones, prompt-in-the-loop knowledge distillation, and tailored memory attention optimizations for video, with active research expanding deployment scenarios, input modalities, and downstream vision integration (Zhou et al., 2023, Zhou et al., 13 Jan 2025, Wang et al., 10 Dec 2024).

1. Architectural Basis and Model Compression

The canonical EdgeSAM replaces the transformer-oriented image encoder of SAM (originally >600M parameters, >2.7T FLOPs per input) with a streamlined, mobile-centric CNN backbone. The dominant configuration leverages RepViT-M1—a five-stage mobile CNN producing $256 \times 64 \times 64$ features—combined with a lightweight feature pyramid network (FPN) for resolution matching. Total encoder complexity reduces to 9.6M parameters and 22.1 GFLOPs, an order-of-magnitude reduction compared to SAM and ~2× below MobileSAM (Zhou et al., 2023).

Encoder Replacement: SAM's heavy ViT encoder is discarded in favor of a compact RepViT-M1+FPN; image features from the CNN are directly substituted for the ViT features expected by the SAM mask decoder.
Prompt Encoder & Mask Decoder Retention: The original SAM prompt encoder (box, point, mask, IoU token) and two-stream mask decoder are reused with weights initialized from the teacher model. This ensures full architectural compatibility with interactive prompt sets.

2. Prompt-In-The-Loop Distillation

Performance parity with SAM is not attainable via encoder-only knowledge distillation. EdgeSAM introduces a "prompt-in-the-loop" distillation strategy where box and point prompts are sampled and injected into both teacher and student, allowing the student to capture both feature representations and prompt-to-mask generation dynamics (Zhou et al., 2023).

Stage 1 (Encoder KD): Pixel-wise MSE matches the CNN student to the teacher's ViT features over large subsets of SA-1B.
Stage 2 (Prompt-Integrated KD): Box and point prompts are iteratively sampled. Areas of disagreement (teacher predicts mask, student does not—false negatives; vice versa for false positives) yield additional sampled prompts (positive or negative points). Losses aggregate over dynamic prompt loops, simultaneously training encoder and decoder.
Granularity Correction: A small region proposal network (RPN), appended to the encoder, integrates object-scale priors—mitigating the bias toward part/micro-object masks common in prompt-based SAM variants when trained on class-agnostic annotations.

3. EdgeSAM for Video: Memory Attention and 2D Spatial Perceiver

The extension to video segmentation, as in "EdgeTAM"(Zhou et al., 13 Jan 2025), reveals a new bottleneck: high-dimensional memory attention across multiple frames. While prior EdgeSAM efforts focus on encoder compression, SAM 2's latency on mobile is dominated by Transformer-based cross-attention over frame-memory tensors ( $O(TCH^2W^2)$ ). EdgeSAM introduces a 2D Spatial Perceiver—splitting attention between "global" and "local" learnable query latents—to compress memories while preserving spatial structure and facilitating efficient on-device attention (Zhou et al., 13 Jan 2025).

Global Perceiver: $N_g \ll HW$ latents attend densely over $M_t$ , reducing the cost to $O(TCHWN_g)$ for global context.
2D Spatial Perceiver: $N_l$ local latents partition $M_t$ into windows, each attending over localized regions, forming a "downsampled" spatial memory.
Fusion: Concatenation yields $N_g + N_l$ compressed tokens per frame, reducing overall attention cost to $O(TCHW(N_g+N_l))$ , achieving $\sim8\times$ speed-up over vanilla SAM 2 memory attention for recommended settings ( $N_g=N_l=256$ , $T=7$ , $HW=4096$ ).
Distillation Pipeline: Follows a two-stage approach (image segmentation pre-training; video segmentation training with occlusion-aware and memory distillation terms), ensuring that the student mimics both boundary localization and memory retrieval.

4. EdgeSAM in Specialized Domains: Crack Segmentation

Recent edge-centric adaptations such as CrackESS deploy EdgeSAM as a plug-and-play module for infrastructure health monitoring (Wang et al., 10 Dec 2024).

Self-Prompting: A YOLOv8n detector generates bounding box prompts for cracks, feeding directly into EdgeSAM’s prompt encoder.
LoRA Fine-Tuning: Only the convolutional layers governing local attention and channel-mixing are updated via ConvLoRA (rank $r=4$ recommended)—the rest of the encoder and decoder remain frozen.
Mask Refinement: Segmentation outputs undergo a post-processing pipeline (thresholding, morphological closing, hole-filling, small-blob removal, smoothing via median blur).
Efficiency: EdgeSAM yields $>40$ FPS at $1024^2$ input size on Jetson Orin Nano, with precision/recall/F1/IoU performance outperforming prior lightweight crack segmentation frameworks.

Task	Dataset	FPS	mIoU EdgeSAM	Accuracy Drop vs. SAM
Image Segmentation	SA-23 (5 clicks)	40.4	81.7	≈0
Video Segmentation	DAVIS 2017	16	87.7	↓3.2
Crack Segmentation	Khanhha's Test	46.1	0.7143 (F1)	—

5. Deployment Guidelines and Latency/Accuracy Trade-Offs

EdgeSAM's real-world utility follows from its explicit hardware-aware optimizations and quantization support (Zhou et al., 13 Jan 2025, Zhou et al., 2023).

Recommended Backbone: RepViT-M1 (mobile CNN); optionally ViT-Tiny for extreme compression.
Memory Attention Blocks: 2 blocks yield best latency/accuracy balance (16 FPS on iPhone 15 Pro Max); 1 block ups throughput at mild cost, 4 blocks slow to ~10 FPS.
Perceiver Latents: $(N_g,N_l)=(256,256)$ .
Model Compression: Quantization to bfloat16/float16, CoreML conversion, linear+LayerNorm fusion.
Hardware: iPhone 14/15 Pro Max, Jetson Nano-class edge GPUs (8GB RAM recommended).
Runtime Tips: Pre-allocate Perceiver buffers, pin constants to NPU, use accelerated graph ops.

6. Applications, Extensibility, and Limitations

EdgeSAM's distillation and memory attention strategies are generalizable to interactive segmentation, tracking, medical imaging, and boundary estimation in uncertain or multi-granular scenarios.

Plug-In for Edge-Cloud Architectures: EdgeSAM deployments integrate readily with device-server split architectures (see SAMEdge (Lu et al., 23 Sep 2024)), enabling dynamic workload partitioning for low-latency video analytics.
Limitations: EdgeSAM is optimized for sparse interactive prompting; dense prompt grids ("everything mode") remain infeasible on-device, and fine boundary details may require additional negative/positive prompt refinement.
Performance Trade-Offs: Compression reduces FLOPs by ~100×, yielding $>30$ FPS on mobile devices, with typical accuracy degradation of $1$-$3$ mIoU (box prompts) or $4$-$5$ mIoU (point prompts, recoverable via RPN).
Future Directions: Further structured pruning, mixed-precision acceleration, and extensions for temporal memory and streaming video.

EdgeSAM establishes real-time interactive segmentation as a tractable problem on commodity edge hardware through architectural innovation, adaptive distillation, and domain-specific integration, facilitating broad adoption for on-device vision tasks (Zhou et al., 2023, Zhou et al., 13 Jan 2025, Wang et al., 10 Dec 2024, Lu et al., 23 Sep 2024).