Faster Segment Anything: Towards Lightweight SAM for Mobile Applications (2306.14289v2)

Published 25 Jun 2023 in cs.CV

Abstract: Segment Anything Model (SAM) has attracted significant attention due to its impressive zero-shot transfer performance and high versatility for numerous vision applications (like image editing with fine-grained control). Many of such applications need to be run on resource-constraint edge devices, like mobile phones. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight image encoder with a lightweight one. A naive way to train such a new SAM as in the original SAM paper leads to unsatisfactory performance, especially when limited training sources are available. We find that this is mainly caused by the coupled optimization of the image encoder and mask decoder, motivated by which we propose decoupled distillation. Concretely, we distill the knowledge from the heavy image encoder (ViT-H in the original SAM) to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM. The training can be completed on a single GPU within less than one day, and the resulting lightweight SAM is termed MobileSAM which is more than 60 times smaller yet performs on par with the original SAM. For inference speed, With a single GPU, MobileSAM runs around 10ms per image: 8ms on the image encoder and 4ms on the mask decoder. With superior performance, our MobileSAM is around 5 times faster than the concurrent FastSAM and 7 times smaller, making it more suitable for mobile applications. Moreover, we show that MobileSAM can run relatively smoothly on CPU. The code for our project is provided at \href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{MobileSAM}}), with a demo showing that MobileSAM can run relatively smoothly on CPU.

PDF HTML Abstract

Overview of "Faster Segment Anything: Towards Lightweight SAM for Mobile Applications"

This paper addresses the challenge of adapting the Segment Anything Model (SAM) to be more suitable for mobile applications. SAM has gained attention for its ability to perform label-free segmentation using a large image encoder (ViT-H), which can be computationally intensive. The focus here is on developing a more efficient version, termed MobileSAM, by replacing the original heavyweight image encoder with a lightweight alternative.

Key Contributions

The authors introduce a novel approach called decoupled distillation for adapting SAM. This strategy involves transferring knowledge from the original heavy ViT-H image encoder to a smaller one, while ensuring compatibility with the existing mask decoder. The method significantly reduces the computational resources required without sacrificing performance.

Numerical Results

MobileSAM demonstrates remarkable efficiency gains:

The lightweight image encoder reduces parameters by over 100 times compared to the original, with the overall model being 60 times smaller.
Inference speed is enhanced, enabling processing of an image in approximately 10ms on a single GPU, which is five times faster than the original SAM.

When compared with the concurrent FastSAM, MobileSAM is shown to be seven times smaller and four times faster, while maintaining superior performance metrics, such as mean Intersection over Union (mIoU).

Methodological Insights

The decoupled distillation approach involves two main phases:

Distillation of the small image encoder from the ViT-H encoder without immediately involving the mask decoder.
Optional finetuning of the mask decoder to better align with the new image encoder, though the initial results suggest this may not always be necessary.

This method minimizes resource usage by enabling training on a single GPU within less than a day, making the process accessible to researchers with limited computational resources.

Implications and Future Directions

The development of MobileSAM paves the way for deploying advanced segmentation tasks on resource-constrained devices, such as mobile phones, facilitating real-time applications in various fields like augmented reality and mobile-based image editing. The decoupled distillation technique could be applicable beyond SAM, offering a framework for optimizing other vision models for mobile deployment.

Future research could explore further optimizations of the image encoder or extend the SAM framework to handle a broader range of mobile-based applications. Additionally, investigating the integration with other foundational vision models could broaden the applicability and functionality of the SAM framework in diverse environments.

In summary, this work makes a significant step towards bringing foundational vision tasks to mobile platforms, allowing for efficient and practical deployments without exhaustive computational demands.

PDF Markdown Bookmark Chat (Pro)

References (44)

Authors (7)

Chaoning Zhang (66 papers)
Dongshen Han (6 papers)
Yu Qiao (563 papers)
Jung Uk Kim (15 papers)
Sung-Ho Bae (29 papers)
Seungkyu Lee (13 papers)
Choong Seon Hong (165 papers)

Citations (247)

View on Semantic Scholar

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications (2306.14289v2)

Overview of "Faster Segment Anything: Towards Lightweight SAM for Mobile Applications"

Key Contributions

Numerical Results

Methodological Insights

Implications and Future Directions

Related Papers

GitHub

YouTube