EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM (2312.06660v2)

Published 11 Dec 2023 in cs.CV

Abstract: This paper presents EdgeSAM, an accelerated variant of the Segment Anything Model (SAM), optimized for efficient execution on edge devices with minimal compromise in performance. Our approach involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture, better suited for edge devices. We carefully benchmark various distillation strategies and demonstrate that taskagnostic encoder distillation fails to capture the full knowledge embodied in SAM. To overcome this bottleneck, we include both the prompt encoder and mask decoder in the distillation process, with box and point prompts in the loop, so that the distilled model can accurately capture the intricate dynamics between user input and mask generation. To mitigate dataset bias issues stemming from point prompt distillation, we incorporate a lightweight module within the encoder. As a result, EdgeSAM achieves a 37-fold speed increase compared to the original SAM, and it also outperforms MobileSAM/EfficientSAM, being over 7 times as fast when deployed on edge devices while enhancing the mIoUs on COCO and LVIS by 2.3/1.5 and 3.1/1.6, respectively. It is also the first SAM variant that can run at over 30 FPS on an iPhone 14. Code and demo are available at https://www.mmlab-ntu.com/project/edgesam.

References (66)

Authors (4)

Chong Zhou (12 papers)
Xiangtai Li (128 papers)
Chen Change Loy (288 papers)
Bo Dai (245 papers)

Citations (33)

View on Semantic Scholar

Summary

Introduction to EdgeSAM

The Segment Anything Model (SAM) has historically faced significant challenges when deployed directly onto edge devices, such as smartphones, due to its design focusing on powerful hardware capabilities not typically found in these portable devices. SAM's substantial computational requirements have largely kept its interactive segmentation capabilities out of reach for mobile users. However, an innovative approach named EdgeSAM emerges as a solution, offering the potential to unlock real-time interactive segmentation on edge devices.

Overcoming Performance Barriers

The core of this innovation lies in a distillation process that transforms the heavy ViT-based SAM architecture into a leaner, CNN-based structure more amenable to mobile platforms. The distillation strategy is meticulous, going beyond the traditional task-agnostic methods which could not fully extract and transfer SAM's capabilities to a more compact model. By incorporating both the prompt encoder and mask decoder into the distillation loop, EdgeSAM ensures delicate dynamics between user prompts and mask generation are preserved, crucial for maintaining SAM's interactivity on devices.

Performance and Speed

Remarkably, EdgeSAM not only successfully implements the interactive functionality of SAM on edge devices but does so with incredible efficiency—a 40-fold increase in speed compared to the original model. When placed head to head with MobileSAM, EdgeSAM shows a 14-times speed advantage on mobile devices and maintains over 30 frames per second performance on an iPhone 14. The CNN-based architecture chosen for EdgeSAM explains this efficiency, as it aligns better with AI accelerators commonly optimized for convolution operations rather than transformer architectures.

Fine-Tuning Distillation and Real-World Applications

One challenge EdgeSAM helps overcome is dataset bias that may arise during the point prompt distillation phase. By incorporating a lightweight module tuned to dataset-specific granularity, EdgeSAM can accurately respond to varying user prompts with finesse. Empirical benchmarks indicate EdgeSAM does not lag far behind SAM in terms of accuracy when handling prompts on various datasets. The compatibility of EdgeSAM's accurate segmentation with real-time capabilities on mobile devices opens numerous possibilities for applications in video editing, instance segmentation, and other interactive mobile tasks.