Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Group-Mix SAM: Lightweight Solution for Industrial Assembly Line Applications (2403.10053v1)

Published 15 Mar 2024 in cs.CV

Abstract: Since the advent of the Segment Anything Model(SAM) approximately one year ago, it has engendered significant academic interest and has spawned a large number of investigations and publications from various perspectives. However, the deployment of SAM in practical assembly line scenarios has yet to materialize due to its large image encoder, which weighs in at an imposing 632M. In this study, we have replaced the heavyweight image encoder with a lightweight one, thereby enabling the deployment of SAM in practical assembly line scenarios. Specifically, we have employed decoupled distillation to train the encoder of MobileSAM in a resource-limited setting. The entire knowledge distillation experiment can be completed in a single day on a single RTX 4090. The resulting lightweight SAM, referred to as Group-Mix SAM, had 37.63% (2.16M) fewer parameters and 42.5% (15614.7M) fewer floating-point operations compared to MobileSAM. However, on our constructed industrial dataset, MALSD, its mIoU was only marginally lower than that of MobileSAM, at 0.615. Finally, we conducted a comprehensive comparative experiment to demonstrate the superiority of Group-Mix SAM in the industrial domain. With its exceptional performance, our Group-Mix SAM is more suitable for practical assembly line applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era. arXiv preprint arXiv:2304.06488 2023.
  2. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv 2023. arXiv preprint arXiv:2306.14289.
  3. Advancing vision transformers with group-mix attention. arXiv preprint arXiv:2311.15157 2023.
  4. Segment anything. arXiv preprint arXiv:2304.02643 2023.
  5. Levit: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12259–12269.
  6. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 2021.
  7. Tinyvit: Fast pretraining distillation for small vision transformers. In Proceedings of the European Conference on Computer Vision. Springer, 2022, pp. 68–85.
  8. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14420–14430.
  9. Efficientsam: Leveraged masked image pretraining for efficient segment anything. arXiv preprint arXiv:2312.00863 2023.
  10. Edgesam: Prompt-in-the-loop distillation for on-device deployment of sam. arXiv preprint arXiv:2312.06660 2023.
  11. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2015.
  12. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 2014.
  13. Differentiable feature aggregation search for knowledge distillation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16. Springer, 2020, pp. 469–484.
  14. Self-training with noisy student improves imagenet classification. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10687–10698.
  15. Distilling effective supervision from severe label noise. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9294–9303.
  16. Channel-wise knowledge distillation for dense prediction. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5311–5320.
  17. Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In Proceedings of the International Conference on Learning Representations, 2020.
  18. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 2021.
  19. Nest (neural simulation tool). Scholarpedia 2007, 2, 1430.
  20. Regionvit: Regional-to-local attention for vision transformers. arXiv preprint arXiv:2106.02689 2021.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 2020.
  22. Escaping the big data paradigm with compact transformers. arXiv 2021. arXiv preprint arXiv:2104.05704.
  23. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 357–366.
  24. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 558–567.
  25. Rethinking spatial dimensions of vision transformers. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11936–11945.
  26. Going deeper with image transformers. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 32–42.
  27. Cvt: Introducing convolutions to vision transformers. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 22–31.
  28. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com