EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss (2402.05008v2)

Published 7 Feb 2024 in cs.CV, cs.AI, and cs.LG

Abstract: We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance. Our code and pre-trained models are released at https://github.com/mit-han-lab/efficientvit.

PDF Abstract

Introduction

Segment Anything Model (SAM), introduced by Kirillov et al. [1], constitutes a powerful paradigm for image segmentation, capable of zero-shot performance wherein SAM establishes impressive performance standards. It, however, faces considerable computational demand, particularly due to its image encoder. This hinders SAM's deployment in latency-critical environments, despite its broad applicability in AR/VR, data annotation, and more. To combat these performance limitations, the EfficientViT-SAM model is proposed. It retains the light components of SAM—the prompt encoder and mask decoder—while substituting the heavy image encoder with the more efficient EfficientViT.

Related Work

Efforts to accelerate SAM, such as MobileSAM, EdgeSAM, and EfficientSAM, have undergone various levels of optimization strategies, including knowledge distillation and architecture innovation. The EfficientViT architecture, the backbone of EfficientViT-SAM, employs ReLU-based linear attention and multi-scale learning to achieve a globally responsive yet hardware-efficient operation. This powerful new undertaking aims to strike a balance between payload and execution speed, an essential criteria in deploying neural networks in the real world.

Method

EfficientViT-SAM is meticulously engineered to harmonize efficiency with performance. The initial phase harnesses the L2 loss function to distill knowledge from SAM-ViT-H into the EfficientViT accompanied by a prompt encoder and mask decoder. The end-to-end training follows on the extensive SA-1B dataset, employing a combination of focal loss and dice loss to guide the process. The two-stage image encoder training and the systematic use of the meticulously crafted loss function culminate in an architecture that is robust, light, and notably faster—achieving a remarkable 48.9× speedup on A100 GPU without a performance drop in comparison to the baseline SAM-ViT-H.

Experiment

Evaluations juxtaposing EfficientViT-SAM against its predecessor and other lighter models are striking. Not only does EfficientViT-SAM significantly eclipse the heavyweight SAM-ViT-H in throughput on an A100 GPU, but it also maintains, and in some measures, advances the mean Average Precision (mAP) scores. In zero-shot settings, across datasets like COCO [8] and LVIS [21], EfficientViT-SAM shows commendable robustness, showcasing greater or on-par mAP results with substantially fewer computations involved.

EfficientViT-SAM further impresses with its qualitative segmentation competence across a variety of point- and box-prompt modalities, underpinning its adaptability and the precision of segmentation tasks in the wild, as evidenced by the Sequential Grounding in the Wild Benchmark.

Conclusion

EfficientViT-SAM heralds a new frontier in segment anything models, delivering state-of-the-art efficiency, turning segmented tasks more practicable within real-time or resource-constrained environments. The broad availability of pretrained models on GitHub offers the community a rich resource to pivot towards more effective applications of segment anything paradigms. With the collaborations between Tsinghua University and MIT, and support from entities like NVIDIA, this advancement not only augurs well for the future of image segmentation but also fortifies the case for efficiency in the AI landscape.