Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RepViT-SAM: Towards Real-Time Segmenting Anything (2312.05760v2)

Published 10 Dec 2023 in cs.CV

Abstract: Segment Anything Model (SAM) has shown impressive zero-shot transfer performance for various computer vision tasks recently. However, its heavy computation costs remain daunting for practical applications. MobileSAM proposes to replace the heavyweight image encoder in SAM with TinyViT by employing distillation, which results in a significant reduction in computational requirements. However, its deployment on resource-constrained mobile devices still encounters challenges due to the substantial memory and computational overhead caused by self-attention mechanisms. Recently, RepViT achieves the state-of-the-art performance and latency trade-off on mobile devices by incorporating efficient architectural designs of ViTs into CNNs. Here, to achieve real-time segmenting anything on mobile devices, following MobileSAM, we replace the heavyweight image encoder in SAM with RepViT model, ending up with the RepViT-SAM model. Extensive experiments show that RepViT-SAM can enjoy significantly better zero-shot transfer capability than MobileSAM, along with nearly $10\times$ faster inference speed. The code and models are available at \url{https://github.com/THU-MIG/RepViT}.

Introduction

In the field of computer vision, Segment Anything Model (SAM) has recently been recognized for its exceptional ability to adapt to various tasks without additional training. Despite this flexibility, SAM's computational intensity has hindered its deployment on mobile devices which heavily limits its practicality. MobileSAM addressed some of these issues by employing a lightweight image encoder and distillation techniques, but it still stumbled when it came to speed and memory requirements, especially on mobile platforms.

Methodology

The newly proposed RepViT-SAM model aims to refine SAM's architecture for real-time performance on mobile devices. It replaces the heavy image encoder from the original SAM with a RepViT model, an architecture that incorporates efficient designs from Convolutional Neural Networks (CNNs) within the Vision Transformer (ViT) framework. Leveraging efficient components like early convolutions, structural reparameterized depthwise convolutions, and squeeze-and-excitation layers, RepViT-SAM aims to deliver high-quality segmentation at a significantly reduced computational cost. In tests, it demonstrated an impressive 10× faster inference speed on a MacBook compared to MobileSAM, without compromising on zero-shot transfer capabilities.

Experimental Results

As part of the assessment, RepViT-SAM underwent various tests to compare its performance with other models in the domain. The comprehensive experiments conducted included zero-shot edge detection, instance segmentation, segmentation in the wild, video object segmentations, and other real-world applications. RepViT-SAM was found to possess a superior zero-shot transfer ability and significantly faster inference speeds when compared to its MobileSAM counterpart. Moreover, it managed to produce comparable performance with the heavyweight ViT-based SAM models, showcasing a promising balance between efficiency and effectiveness.

Conclusion

RepViT-SAM has established itself as a formidable model for efficient image and video segmentation tasks, especially suited for applications on mobile and other resource-constrained devices. Its distillation strategy and architectural solutions pave the way for future innovation in the field of lightweight, real-time computer vision systems. With the release of its code and models to the public domain, developers and researchers now have access to a robust framework for further exploration and development of segmenting solutions that can operate in real-time mobile environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2010.
  2. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019.
  3. Segment any anomaly without training via hybrid prompt regularization. arXiv preprint arXiv:2305.10724, 2023.
  4. Make repvgg greater again: A quantization-aware approach. arXiv preprint arXiv:2212.01593, 2022.
  5. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  7. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2918–2928, 2021.
  8. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  9. Segment anything is not always perfect: An investigation of sam on different real-world applications. arXiv preprint arXiv:2304.05750, 2023.
  10. Detrs with hybrid matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19702–19712, 2023.
  11. Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023.
  12. Segment anything in high quality. arXiv preprint arXiv:2306.01567, 2023.
  13. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  14. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  15. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  16. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  17. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, pages 416–423. IEEE, 2001.
  18. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  19. Segment anything meets point tracking. arXiv preprint arXiv:2307.01197, 2023.
  20. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  21. Repvit: Revisiting mobile cnn from vit perspective. arXiv preprint arXiv:2307.09283, 2023.
  22. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 136–145, 2017.
  23. Unidentified video objects: A benchmark for dense, open-world segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10776–10785, 2021.
  24. Tinyvit: Fast pretraining distillation for small vision transformers. In European Conference on Computer Vision, pages 68–85. Springer, 2022.
  25. Early convolutions help transformers see better. Advances in neural information processing systems, 34:30392–30400, 2021.
  26. Efficientsam: Leveraged masked image pretraining for efficient segment anything. arXiv preprint arXiv:2312.00863, 2023.
  27. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
  28. Fast segment anything. arXiv preprint arXiv:2306.12156, 2023.
  29. Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15116–15127, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ao Wang (43 papers)
  2. Hui Chen (298 papers)
  3. Zijia Lin (43 papers)
  4. Jungong Han (111 papers)
  5. Guiguang Ding (79 papers)
Citations (9)
Github Logo Streamline Icon: https://streamlinehq.com