Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM (2312.06660v2)

Published 11 Dec 2023 in cs.CV

Abstract: This paper presents EdgeSAM, an accelerated variant of the Segment Anything Model (SAM), optimized for efficient execution on edge devices with minimal compromise in performance. Our approach involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture, better suited for edge devices. We carefully benchmark various distillation strategies and demonstrate that taskagnostic encoder distillation fails to capture the full knowledge embodied in SAM. To overcome this bottleneck, we include both the prompt encoder and mask decoder in the distillation process, with box and point prompts in the loop, so that the distilled model can accurately capture the intricate dynamics between user input and mask generation. To mitigate dataset bias issues stemming from point prompt distillation, we incorporate a lightweight module within the encoder. As a result, EdgeSAM achieves a 37-fold speed increase compared to the original SAM, and it also outperforms MobileSAM/EfficientSAM, being over 7 times as fast when deployed on edge devices while enhancing the mIoUs on COCO and LVIS by 2.3/1.5 and 3.1/1.6, respectively. It is also the first SAM variant that can run at over 30 FPS on an iPhone 14. Code and demo are available at https://www.mmlab-ntu.com/project/edgesam.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Yolact: Real-time instance segmentation. In ICCV, 2019.
  2. Efficientvit: Lightweight multi-scale attention for on-device semantic segmentation. In ICCV, 2023.
  3. End-to-end object detection with transformers. In ECCV, 2020.
  4. Detrdistill: A universal knowledge distillation framework for detr-families. In ICCV, 2023.
  5. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint, 2019.
  6. D3etr: Decoder distillation for detection transformer. arXiv preprint, 2022a.
  7. Mobile-former: Bridging mobilenet and transformer. In CVPR, 2022b.
  8. Boundary iou: Improving object-centric image segmentation evaluation. In CVPR, 2021.
  9. Getting to 99% accuracy in interactive segmentation. arXiv preprint, 2020.
  10. Differentiable feature aggregation search for knowledge distillation. In ECCV, 2020.
  11. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  12. Ghostnet: More features from cheap operations. In CVPR, 2020.
  13. Distilling the knowledge in a neural network. NeurIPSW, 2014.
  14. Lpsnet: A lightweight solution for fast panoptic segmentation. In CVPR, 2021a.
  15. Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv preprint, 2021b.
  16. Searching for mobilenetv3. In ICCV, 2019.
  17. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint, 2017.
  18. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  19. You only segment once: Towards real-time panoptic segmentation. In CVPR, 2023.
  20. Teach-detr: Better training detr with teachers. arXiv preprint, 2022.
  21. Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution. 1992.
  22. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint, 2016.
  23. Segment anything. ICCV, 2023.
  24. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. arXiv preprint, 2022a.
  25. Semantic flow for fast and accurate scene parsing. In ECCV, 2020.
  26. Sfnet: Faster and accurate semantic segmentation via semantic flow. IJCV, 2023.
  27. Exploring plain vision transformer backbones for object detection. In ECCV, 2022b.
  28. Microsoft coco: Common objects in context. In ECCV, 2014.
  29. Feature pyramid networks for object detection. In CVPR, 2017a.
  30. Focal loss for dense object detection. In ICCV, 2017b.
  31. Structured knowledge distillation for semantic segmentation. In CVPR, 2019.
  32. Decoupled weight decay regularization. In ICLR, 2019.
  33. Mocovit: Mobile convolutional vision transformer. arXiv preprint, 2022.
  34. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018.
  35. Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. In ECCVW, 2022.
  36. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In ICLR, 2022.
  37. Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In ECCV, 2018.
  38. Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In CVPR, 2019.
  39. Edgevits: Competing light-weight cnns on mobile devices with vision transformers. In ECCV, 2022.
  40. Learning transferable visual models from natural language supervision. In ICML, 2021.
  41. Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS, 2015.
  42. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
  43. Channel-wise knowledge distillation for dense prediction. In ICCV, 2021.
  44. Reviving iterative training with mask guidance for interactive segmentation. In ICIP, 2022.
  45. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In MICCAIW, 2017.
  46. Efficientdet: Scalable and efficient object detection. In CVPR, 2020.
  47. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 2020.
  48. Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation. In ICLR, 2023.
  49. Repvit: Revisiting mobile cnn from vit perspective. arXiv preprint, 2023.
  50. Distilling object detectors with fine-grained feature imitation. In CVPR, 2019.
  51. Intra-class feature variation distillation for semantic segmentation. In ECCV, 2020.
  52. Tinyvit: Fast pretraining distillation for small vision transformers. In ECCV, 2022.
  53. Improving fast segmentation with teacher-student learning. BMVC, 2018.
  54. Self-training with noisy student improves imagenet classification. In CVPR, 2020.
  55. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV, 2018.
  56. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. IJCV, 2021.
  57. Revisiting knowledge distillation via label smoothing regularization. In CVPR, 2020.
  58. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint, 2023a.
  59. Rethinking mobile block for efficient attention-based models. In ICCV, 2023b.
  60. Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In ICLR, 2020.
  61. Topformer: Token pyramid transformer for mobile semantic segmentation. In CVPR, 2022.
  62. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
  63. Distilling effective supervision from severe label noise. In CVPR, 2020.
  64. Icnet for real-time semantic segmentation on high-resolution images. In ECCV, 2018.
  65. Fast segment anything. arXiv preprint, 2023.
  66. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chong Zhou (12 papers)
  2. Xiangtai Li (128 papers)
  3. Chen Change Loy (288 papers)
  4. Bo Dai (245 papers)
Citations (33)

Summary

Introduction to EdgeSAM

The Segment Anything Model (SAM) has historically faced significant challenges when deployed directly onto edge devices, such as smartphones, due to its design focusing on powerful hardware capabilities not typically found in these portable devices. SAM's substantial computational requirements have largely kept its interactive segmentation capabilities out of reach for mobile users. However, an innovative approach named EdgeSAM emerges as a solution, offering the potential to unlock real-time interactive segmentation on edge devices.

Overcoming Performance Barriers

The core of this innovation lies in a distillation process that transforms the heavy ViT-based SAM architecture into a leaner, CNN-based structure more amenable to mobile platforms. The distillation strategy is meticulous, going beyond the traditional task-agnostic methods which could not fully extract and transfer SAM's capabilities to a more compact model. By incorporating both the prompt encoder and mask decoder into the distillation loop, EdgeSAM ensures delicate dynamics between user prompts and mask generation are preserved, crucial for maintaining SAM's interactivity on devices.

Performance and Speed

Remarkably, EdgeSAM not only successfully implements the interactive functionality of SAM on edge devices but does so with incredible efficiency—a 40-fold increase in speed compared to the original model. When placed head to head with MobileSAM, EdgeSAM shows a 14-times speed advantage on mobile devices and maintains over 30 frames per second performance on an iPhone 14. The CNN-based architecture chosen for EdgeSAM explains this efficiency, as it aligns better with AI accelerators commonly optimized for convolution operations rather than transformer architectures.

Fine-Tuning Distillation and Real-World Applications

One challenge EdgeSAM helps overcome is dataset bias that may arise during the point prompt distillation phase. By incorporating a lightweight module tuned to dataset-specific granularity, EdgeSAM can accurately respond to varying user prompts with finesse. Empirical benchmarks indicate EdgeSAM does not lag far behind SAM in terms of accuracy when handling prompts on various datasets. The compatibility of EdgeSAM's accurate segmentation with real-time capabilities on mobile devices opens numerous possibilities for applications in video editing, instance segmentation, and other interactive mobile tasks.

Youtube Logo Streamline Icon: https://streamlinehq.com