Faster Segment Anything: Towards Lightweight SAM for Mobile Applications (2306.14289v2)
Abstract: Segment Anything Model (SAM) has attracted significant attention due to its impressive zero-shot transfer performance and high versatility for numerous vision applications (like image editing with fine-grained control). Many of such applications need to be run on resource-constraint edge devices, like mobile phones. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight image encoder with a lightweight one. A naive way to train such a new SAM as in the original SAM paper leads to unsatisfactory performance, especially when limited training sources are available. We find that this is mainly caused by the coupled optimization of the image encoder and mask decoder, motivated by which we propose decoupled distillation. Concretely, we distill the knowledge from the heavy image encoder (ViT-H in the original SAM) to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM. The training can be completed on a single GPU within less than one day, and the resulting lightweight SAM is termed MobileSAM which is more than 60 times smaller yet performs on par with the original SAM. For inference speed, With a single GPU, MobileSAM runs around 10ms per image: 8ms on the image encoder and 4ms on the mask decoder. With superior performance, our MobileSAM is around 5 times faster than the concurrent FastSAM and 7 times smaller, making it more suitable for mobile applications. Moreover, we show that MobileSAM can run relatively smoothly on CPU. The code for our project is provided at \href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{MobileSAM}}), with a demo showing that MobileSAM can run relatively smoothly on CPU.
- One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era. arXiv preprint arXiv:2304.06488, 2023a.
- A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv preprint arXiv:2303.11717, 2023b.
- Language models are few-shot learners. Advances in neural information processing systems, 2020.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 2019.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Mp-fedcl: Multi-prototype federated contrastive learning for edge intelligence. arXiv preprint arXiv:2304.01950, 2023a.
- How does simsiam avoid collapse without negative samples? a unified understanding with self-supervised contrastive learning. In ICLR, 2022a.
- Dual temperature helps contrastive learning without many negative samples: Towards understanding and simplifying moco. In CVPR, 2022b.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- A survey on segment anything model (sam): Vision foundation model meets prompt engineering. 2023c.
- Fast segment anything. arXiv preprint arXiv:2306.12156, 2023.
- Jun Ma and Bo Wang. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
- Input augmentation with sam: Boosting medical image segmentation with segmentation foundation model. arXiv preprint arXiv:2304.11332, 2023d.
- Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709, 2023.
- Segment anything model (sam) meets glass: Mirror and transparent objects cannot be easily detected. arXiv preprint, 2023.
- Attack-sam: Towards evaluating adversarial robustness of segment anything model. arXiv preprint, 2023e.
- Robustness of sam: Segment anything under corruptions and beyond. arXiv preprint arXiv:2306.07713, 2023b.
- IDEA-Research. Grounded segment anything, 2023. URL https://github.com/IDEA-Research/Grounded-Segment-Anything. GitHub repository.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023a.
- Semantic-segment-anything, 2023. URL https://github.com/fudan-zvg/Semantic-Segment-Anything. GitHub repository.
- Curt Park. segment anything with clip, 2023. URL https://github.com/Curt-Park/segment-anything-with-clip. GitHub repository.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
- Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023.
- Zxyang. Segment and track anything, 2023. URL https://github.com/z-x-yang/Segment-and-Track-Anything. GitHub repository.
- Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261, 2023.
- Any-speaker adaptive text-to-speech synthesis with diffusion models. arXiv preprint arXiv:2211.09383, 2022.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
- Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2020.
- Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178, 2021.
- Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems, 35:12934–12949, 2022a.
- Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14420–14430, 2023b.
- Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. arXiv preprint arXiv:2207.05501, 2022b.
- Tinyvit: Fast pretraining distillation for small vision transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, pages 68–85. Springer, 2022.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Decoupled adversarial contrastive learning for self-supervised adversarial robustness. In ECCV, pages 725–742. Springer, 2022c.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
- Chaoning Zhang (66 papers)
- Dongshen Han (6 papers)
- Yu Qiao (563 papers)
- Jung Uk Kim (15 papers)
- Sung-Ho Bae (29 papers)
- Seungkyu Lee (13 papers)
- Choong Seon Hong (165 papers)