Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Segment Anything (2306.12156v1)

Published 21 Jun 2023 in cs.CV and cs.AI

Abstract: The recently proposed segment anything model (SAM) has made a significant influence in many computer vision tasks. It is becoming a foundation step for many high-level tasks, like image segmentation, image caption, and image editing. However, its huge computation costs prevent it from wider applications in industry scenarios. The computation mainly comes from the Transformer architecture at high-resolution inputs. In this paper, we propose a speed-up alternative method for this fundamental task with comparable performance. By reformulating the task as segments-generation and prompting, we find that a regular CNN detector with an instance segmentation branch can also accomplish this task well. Specifically, we convert this task to the well-studied instance segmentation task and directly train the existing instance segmentation method using only 1/50 of the SA-1B dataset published by SAM authors. With our method, we achieve a comparable performance with the SAM method at 50 times higher run-time speed. We give sufficient experimental results to demonstrate its effectiveness. The codes and demos will be released at https://github.com/CASIA-IVA-Lab/FastSAM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2010.
  2. Multiscale combinatorial grouping. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 328–335, 2014.
  3. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In CVPR, pages 9592–9600, 2019.
  4. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9157–9166, 2019.
  5. Salient object detection: A survey. Computational Visual Media, 5:117–150, 2019.
  6. John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
  7. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
  8. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
  9. Efficient graph-based image segmentation. International journal of computer vision, 59:167–181, 2004.
  10. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  11. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  12. Region-based convolutional networks for accurate object detection and segmentation. IEEE transactions on pattern analysis and machine intelligence, 38(1):142–158, 2015.
  13. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.
  14. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Transactions on Geoscience and Remote Sensing, 57(1):574–586, 2018.
  15. Glenn Jocher. Yolov5 by ultralytics, 2020. https://github.com/ultralytics/yolov5.
  16. Yolo by ultralytics, 2023. https://github.com/ultralytics/ultralytics.
  17. Learning open-world object proposals without learning to classify. IEEE Robotics and Automation Letters, 7(2):5453–5460, 2022.
  18. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
  19. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  20. Geodesic object proposals. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 725–739. Springer, 2014.
  21. Yolov6 v3.0: A full-scale reloading, 2023.
  22. Exploring denoised cross-video contrast for weakly-supervised temporal action localization. In CVPR, pages 19914–19924, 2022.
  23. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  24. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  25. Learning selective mutual attention and contrast for rgb-d saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9026–9042, 2021.
  26. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 416–423. IEEE, 2001.
  27. Learning to segment object candidates. Advances in neural information processing systems, 28, 2015.
  28. Edter: Edge detection with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1402–1412, 2022.
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  30. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  31. A 3x3 isotropic gradient operator for image processing. a talk at the Stanford Artificial Project in, pages 271–272, 1968.
  32. Selective search for object recognition. International journal of computer vision, 104:154–171, 2013.
  33. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696, 2022.
  34. Detecting everything in the open world: Towards universal object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433–11443, 2023.
  35. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
  36. Edge boxes: Locating object proposals from edges. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 391–405. Springer, 2014.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xu Zhao (64 papers)
  2. Wenchao Ding (33 papers)
  3. Yongqi An (5 papers)
  4. Yinglong Du (1 paper)
  5. Tao Yu (282 papers)
  6. Min Li (246 papers)
  7. Ming Tang (199 papers)
  8. Jinqiao Wang (76 papers)
Citations (195)

Summary

Fast Segment Anything: A High-Efficiency Approach to Instance Segmentation

The paper "Fast Segment Anything" by Zhao et al. presents a compelling alternative to the Segment Anything Model (SAM), offering a more computationally efficient approach to instance segmentation without significant sacrifices in performance. This work specifically targets the computational challenges of SAM, which relies heavily on the Transformer architecture, thus hindering its practical applicability in real-time scenarios due to high resource demands.

Methodology Overview

The proposed solution, FastSAM, reframes the segmentation task as a two-stage process involving all-instance segmentation followed by prompt-guided selection. This decoupling is pivotal in reducing computational demands. The first stage employs a Convolutional Neural Network (CNN) detector using the YOLOv8-seg model, renowned for its object detection capabilities and equipped with an instance segmentation branch inspired by YOLACT. By leveraging CNNs' computational efficiency, the authors achieve a 50x increase in runtime speed compared to SAM on a single NVIDIA GeForce RTX 3090 without compromising on performance significantly.

Key Contributions and Results

FastSAM is particularly notable for its ability to match SAM's performance at a fraction of the computational cost. By utilizing a mere 1/50 of the SA-1B dataset, the authors demonstrate equivalency in quality to SAM on tasks such as edge detection, object proposal generation, instance segmentation, and text-prompt-based object localization. FastSAM yields a notable advantage in speed: it runs 50 times faster than SAM’s standard inference.

The paper reports robust performance on well-known datasets, including COCO and LVIS, for object proposal generation, where FastSAM surpasses previous methods with an AR@1000 score of 63.7. However, it faces some challenges with fine-grained mask quality, notably in small object segmentation, suggesting limitations in prototype-based methods like YOLACT.

Practical Implications and Future Directions

FastSAM's efficiency positions it as an attractive solution for industrial applications requiring real-time processing, such as anomaly detection and video tracking. This work introduces the possibility of leveraging CNNs for tasks previously dominated by transformer models, suggesting a shift back towards model specificity and efficiency-accuracy trade-offs tailored for particular tasks.

Moving forward, enhancements could target the scoring mechanism and prototype capacities to better manage small object segmentation and improve mask quality. Moreover, scaling up the dataset usage could further refine the model's accuracy. The integration of CLIP for text prompts also opens avenues for more sophisticated multimodal tasks.

In conclusion, FastSAM signifies a meaningful step in balancing efficiency with performance, raising pertinent discussions about architectural choices in computer vision models and their deployment in resource-constrained environments.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com