Fast Segment Anything (2306.12156v1)

Published 21 Jun 2023 in cs.CV and cs.AI

Abstract: The recently proposed segment anything model (SAM) has made a significant influence in many computer vision tasks. It is becoming a foundation step for many high-level tasks, like image segmentation, image caption, and image editing. However, its huge computation costs prevent it from wider applications in industry scenarios. The computation mainly comes from the Transformer architecture at high-resolution inputs. In this paper, we propose a speed-up alternative method for this fundamental task with comparable performance. By reformulating the task as segments-generation and prompting, we find that a regular CNN detector with an instance segmentation branch can also accomplish this task well. Specifically, we convert this task to the well-studied instance segmentation task and directly train the existing instance segmentation method using only 1/50 of the SA-1B dataset published by SAM authors. With our method, we achieve a comparable performance with the SAM method at 50 times higher run-time speed. We give sufficient experimental results to demonstrate its effectiveness. The codes and demos will be released at https://github.com/CASIA-IVA-Lab/FastSAM.

References (36)

Authors (8)

Xu Zhao (64 papers)
Wenchao Ding (33 papers)
Yongqi An (5 papers)
Yinglong Du (1 paper)
Tao Yu (282 papers)
Min Li (246 papers)
Ming Tang (199 papers)
Jinqiao Wang (76 papers)

Citations (195)

View on Semantic Scholar

Summary

Fast Segment Anything: A High-Efficiency Approach to Instance Segmentation

The paper "Fast Segment Anything" by Zhao et al. presents a compelling alternative to the Segment Anything Model (SAM), offering a more computationally efficient approach to instance segmentation without significant sacrifices in performance. This work specifically targets the computational challenges of SAM, which relies heavily on the Transformer architecture, thus hindering its practical applicability in real-time scenarios due to high resource demands.

Methodology Overview

The proposed solution, FastSAM, reframes the segmentation task as a two-stage process involving all-instance segmentation followed by prompt-guided selection. This decoupling is pivotal in reducing computational demands. The first stage employs a Convolutional Neural Network (CNN) detector using the YOLOv8-seg model, renowned for its object detection capabilities and equipped with an instance segmentation branch inspired by YOLACT. By leveraging CNNs' computational efficiency, the authors achieve a 50x increase in runtime speed compared to SAM on a single NVIDIA GeForce RTX 3090 without compromising on performance significantly.

Key Contributions and Results

FastSAM is particularly notable for its ability to match SAM's performance at a fraction of the computational cost. By utilizing a mere 1/50 of the SA-1B dataset, the authors demonstrate equivalency in quality to SAM on tasks such as edge detection, object proposal generation, instance segmentation, and text-prompt-based object localization. FastSAM yields a notable advantage in speed: it runs 50 times faster than SAM’s standard inference.

The paper reports robust performance on well-known datasets, including COCO and LVIS, for object proposal generation, where FastSAM surpasses previous methods with an AR@1000 score of 63.7. However, it faces some challenges with fine-grained mask quality, notably in small object segmentation, suggesting limitations in prototype-based methods like YOLACT.

Practical Implications and Future Directions

FastSAM's efficiency positions it as an attractive solution for industrial applications requiring real-time processing, such as anomaly detection and video tracking. This work introduces the possibility of leveraging CNNs for tasks previously dominated by transformer models, suggesting a shift back towards model specificity and efficiency-accuracy trade-offs tailored for particular tasks.

Moving forward, enhancements could target the scoring mechanism and prototype capacities to better manage small object segmentation and improve mask quality. Moreover, scaling up the dataset usage could further refine the model's accuracy. The integration of CLIP for text prompts also opens avenues for more sophisticated multimodal tasks.

In conclusion, FastSAM signifies a meaningful step in balancing efficiency with performance, raising pertinent discussions about architectural choices in computer vision models and their deployment in resource-constrained environments.

PDF Markdown

Fast Segment Anything (2306.12156v1)

Summary

Fast Segment Anything: A High-Efficiency Approach to Instance Segmentation

Methodology Overview

Key Contributions and Results

Practical Implications and Future Directions

Related Papers

GitHub

YouTube