Introduction
In the field of computer vision, Segment Anything Model (SAM) has recently been recognized for its exceptional ability to adapt to various tasks without additional training. Despite this flexibility, SAM's computational intensity has hindered its deployment on mobile devices which heavily limits its practicality. MobileSAM addressed some of these issues by employing a lightweight image encoder and distillation techniques, but it still stumbled when it came to speed and memory requirements, especially on mobile platforms.
Methodology
The newly proposed RepViT-SAM model aims to refine SAM's architecture for real-time performance on mobile devices. It replaces the heavy image encoder from the original SAM with a RepViT model, an architecture that incorporates efficient designs from Convolutional Neural Networks (CNNs) within the Vision Transformer (ViT) framework. Leveraging efficient components like early convolutions, structural reparameterized depthwise convolutions, and squeeze-and-excitation layers, RepViT-SAM aims to deliver high-quality segmentation at a significantly reduced computational cost. In tests, it demonstrated an impressive 10× faster inference speed on a MacBook compared to MobileSAM, without compromising on zero-shot transfer capabilities.
Experimental Results
As part of the assessment, RepViT-SAM underwent various tests to compare its performance with other models in the domain. The comprehensive experiments conducted included zero-shot edge detection, instance segmentation, segmentation in the wild, video object segmentations, and other real-world applications. RepViT-SAM was found to possess a superior zero-shot transfer ability and significantly faster inference speeds when compared to its MobileSAM counterpart. Moreover, it managed to produce comparable performance with the heavyweight ViT-based SAM models, showcasing a promising balance between efficiency and effectiveness.
Conclusion
RepViT-SAM has established itself as a formidable model for efficient image and video segmentation tasks, especially suited for applications on mobile and other resource-constrained devices. Its distillation strategy and architectural solutions pave the way for future innovation in the field of lightweight, real-time computer vision systems. With the release of its code and models to the public domain, developers and researchers now have access to a robust framework for further exploration and development of segmenting solutions that can operate in real-time mobile environments.