EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything (2312.00863v1)

Published 1 Dec 2023 in cs.CV

Abstract: Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.

PDF HTML Abstract

The paper introduces a new technique named EfficientSAM, which stands for Efficient Segment Anything Model. This methodology aims to lower the computational complexity of large-scale Transformer models used in vision tasks, like the Segment Anything Model (SAM). SAMs are highly effective for a variety of image segmentation tasks, but their large size often restricts their deployment in real-world applications due to high computational demands.

To make SAMs more accessible and practical for use, the researchers developed a strategy involving the use of lightweight Vision Transformer (ViT) models, which retain respectable performance while significantly reducing complexity. The key innovation in this approach is the application of masked image pretraining, referred to as "SAMI", which essentially teaches the smaller models to reconstruct features from the more extensive image encoders used in SAM. After pretraining on a large dataset (SA-1B), these efficient models are fine-tuned to carry out the segment anything task.

The research team conducted extensive evaluations across multiple vision tasks including image classification, object detection, instance segmentation, and semantic detection. The results show that the SAMI method surpasses other masked image pretraining approaches, and the EfficientSAM models, with their lightweight encoders, achieve significant performance gains over other existing models, particularly in terms of accuracy compared to complexity trade-offs.

EfficientSAM models are especially noteworthy because they allow segmenting objects in images without specifically being trained on those objects, which is sometimes referred to as zero-shot learning. For instance, on the COCO and LVIS datasets, the EfficientSAM model demonstrates superior segment anything performance in comparison to commonly used lightweight models.

In summary, the EfficientSAM methodology presents a more computationally efficient alternative to SAM, maintaining high accuracy in various vision tasks, including the advanced capability of zero-shot learning. The researchers plan to release the code and models to support further development and application of efficient SAM models.

PDF Markdown Bookmark Chat (Pro)

References (73)

Authors (12)

Yunyang Xiong (25 papers)
Bala Varadarajan (1 paper)
Lemeng Wu (29 papers)
Xiaoyu Xiang (26 papers)
Fanyi Xiao (25 papers)
Chenchen Zhu (26 papers)
Xiaoliang Dai (44 papers)
Dilin Wang (37 papers)
Fei Sun (151 papers)
Forrest Iandola (23 papers)
Raghuraman Krishnamoorthi (29 papers)
Vikas Chandra (75 papers)

Citations (94)

View on Semantic Scholar

Tweets

https://twitter.com/skalskip92/status/1762819586561384735

https://twitter.com/1359295331012329474/status/1732208913725186372

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything (2312.00863v1)

Related Papers

Tweets