Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications (2306.14289v2)

Published 25 Jun 2023 in cs.CV

Abstract: Segment Anything Model (SAM) has attracted significant attention due to its impressive zero-shot transfer performance and high versatility for numerous vision applications (like image editing with fine-grained control). Many of such applications need to be run on resource-constraint edge devices, like mobile phones. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight image encoder with a lightweight one. A naive way to train such a new SAM as in the original SAM paper leads to unsatisfactory performance, especially when limited training sources are available. We find that this is mainly caused by the coupled optimization of the image encoder and mask decoder, motivated by which we propose decoupled distillation. Concretely, we distill the knowledge from the heavy image encoder (ViT-H in the original SAM) to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM. The training can be completed on a single GPU within less than one day, and the resulting lightweight SAM is termed MobileSAM which is more than 60 times smaller yet performs on par with the original SAM. For inference speed, With a single GPU, MobileSAM runs around 10ms per image: 8ms on the image encoder and 4ms on the mask decoder. With superior performance, our MobileSAM is around 5 times faster than the concurrent FastSAM and 7 times smaller, making it more suitable for mobile applications. Moreover, we show that MobileSAM can run relatively smoothly on CPU. The code for our project is provided at \href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{MobileSAM}}), with a demo showing that MobileSAM can run relatively smoothly on CPU.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era. arXiv preprint arXiv:2304.06488, 2023a.
  2. A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv preprint arXiv:2303.11717, 2023b.
  3. Language models are few-shot learners. Advances in neural information processing systems, 2020.
  4. Improving language understanding by generative pre-training. 2018.
  5. Language models are unsupervised multitask learners. OpenAI blog, 2019.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  7. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  8. Mp-fedcl: Multi-prototype federated contrastive learning for edge intelligence. arXiv preprint arXiv:2304.01950, 2023a.
  9. How does simsiam avoid collapse without negative samples? a unified understanding with self-supervised contrastive learning. In ICLR, 2022a.
  10. Dual temperature helps contrastive learning without many negative samples: Towards understanding and simplifying moco. In CVPR, 2022b.
  11. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  12. A survey on segment anything model (sam): Vision foundation model meets prompt engineering. 2023c.
  13. Fast segment anything. arXiv preprint arXiv:2306.12156, 2023.
  14. Jun Ma and Bo Wang. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
  15. Input augmentation with sam: Boosting medical image segmentation with segmentation foundation model. arXiv preprint arXiv:2304.11332, 2023d.
  16. Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709, 2023.
  17. Segment anything model (sam) meets glass: Mirror and transparent objects cannot be easily detected. arXiv preprint, 2023.
  18. Attack-sam: Towards evaluating adversarial robustness of segment anything model. arXiv preprint, 2023e.
  19. Robustness of sam: Segment anything under corruptions and beyond. arXiv preprint arXiv:2306.07713, 2023b.
  20. IDEA-Research. Grounded segment anything, 2023. URL https://github.com/IDEA-Research/Grounded-Segment-Anything. GitHub repository.
  21. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023a.
  22. Semantic-segment-anything, 2023. URL https://github.com/fudan-zvg/Semantic-Segment-Anything. GitHub repository.
  23. Curt Park. segment anything with clip, 2023. URL https://github.com/Curt-Park/segment-anything-with-clip. GitHub repository.
  24. Learning transferable visual models from natural language supervision. In ICML, 2021.
  25. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  26. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
  27. Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023.
  28. Zxyang. Segment and track anything, 2023. URL https://github.com/z-x-yang/Segment-and-Track-Anything. GitHub repository.
  29. Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261, 2023.
  30. Any-speaker adaptive text-to-speech synthesis with diffusion models. arXiv preprint arXiv:2211.09383, 2022.
  31. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  32. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
  33. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
  34. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  35. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2020.
  36. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178, 2021.
  37. Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems, 35:12934–12949, 2022a.
  38. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14420–14430, 2023b.
  39. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. arXiv preprint arXiv:2207.05501, 2022b.
  40. Tinyvit: Fast pretraining distillation for small vision transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, pages 68–85. Springer, 2022.
  41. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  42. Decoupled adversarial contrastive learning for self-supervised adversarial robustness. In ECCV, pages 725–742. Springer, 2022c.
  43. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  44. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chaoning Zhang (66 papers)
  2. Dongshen Han (6 papers)
  3. Yu Qiao (563 papers)
  4. Jung Uk Kim (15 papers)
  5. Sung-Ho Bae (29 papers)
  6. Seungkyu Lee (13 papers)
  7. Choong Seon Hong (165 papers)
Citations (247)

Summary

Overview of "Faster Segment Anything: Towards Lightweight SAM for Mobile Applications"

This paper addresses the challenge of adapting the Segment Anything Model (SAM) to be more suitable for mobile applications. SAM has gained attention for its ability to perform label-free segmentation using a large image encoder (ViT-H), which can be computationally intensive. The focus here is on developing a more efficient version, termed MobileSAM, by replacing the original heavyweight image encoder with a lightweight alternative.

Key Contributions

The authors introduce a novel approach called decoupled distillation for adapting SAM. This strategy involves transferring knowledge from the original heavy ViT-H image encoder to a smaller one, while ensuring compatibility with the existing mask decoder. The method significantly reduces the computational resources required without sacrificing performance.

Numerical Results

MobileSAM demonstrates remarkable efficiency gains:

  • The lightweight image encoder reduces parameters by over 100 times compared to the original, with the overall model being 60 times smaller.
  • Inference speed is enhanced, enabling processing of an image in approximately 10ms on a single GPU, which is five times faster than the original SAM.

When compared with the concurrent FastSAM, MobileSAM is shown to be seven times smaller and four times faster, while maintaining superior performance metrics, such as mean Intersection over Union (mIoU).

Methodological Insights

The decoupled distillation approach involves two main phases:

  1. Distillation of the small image encoder from the ViT-H encoder without immediately involving the mask decoder.
  2. Optional finetuning of the mask decoder to better align with the new image encoder, though the initial results suggest this may not always be necessary.

This method minimizes resource usage by enabling training on a single GPU within less than a day, making the process accessible to researchers with limited computational resources.

Implications and Future Directions

The development of MobileSAM paves the way for deploying advanced segmentation tasks on resource-constrained devices, such as mobile phones, facilitating real-time applications in various fields like augmented reality and mobile-based image editing. The decoupled distillation technique could be applicable beyond SAM, offering a framework for optimizing other vision models for mobile deployment.

Future research could explore further optimizations of the image encoder or extend the SAM framework to handle a broader range of mobile-based applications. Additionally, investigating the integration with other foundational vision models could broaden the applicability and functionality of the SAM framework in diverse environments.

In summary, this work makes a significant step towards bringing foundational vision tasks to mobile platforms, allowing for efficient and practical deployments without exhaustive computational demands.

Youtube Logo Streamline Icon: https://streamlinehq.com