Segment Anything in High Quality (2306.01567v2)

Published 2 Jun 2023 in cs.CV

Abstract: The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced detaset of 44k masks, which takes only 4 hours on 8 GPUs. We show the efficacy of HQ-SAM in a suite of 10 diverse segmentation datasets across different downstream tasks, where 8 out of them are evaluated in a zero-shot transfer protocol. Our code and pretrained models are at https://github.com/SysCV/SAM-HQ.

PDF Abstract

Segment Anything in High Quality: An Expert Overview

The paper "Segment Anything in High Quality" presents HQ-SAM, an incremental advancement over the Segment Anything Model (SAM) for improving segmentation precision while maintaining computational efficiency and zero-shot generalization.

SAM, a foundational model for segmentation tasks, was originally trained on an extensive dataset containing 1.1 billion masks and exhibited robust zero-shot capabilities. However, the model showed limitations in handling objects with intricate structures, leading to coarse boundaries and errors when thin or complex features were present. HQ-SAM addresses these deficiencies through a minimal adaptation of SAM, preserving most of its architecture while introducing novel components to enhance mask quality.

Key Contributions and Methodology

HQ-SAM introduces a High-Quality Output Token, integrated within SAM's mask decoder, designed to predict detailed masks accurately. This token, combined with a mechanism for Global-local Feature Fusion, reuses early and late features from the Vision Transformer (ViT) encoder, providing a refined understanding of both global context and localized details. This architectural integration is lightweight, augmenting SAM’s parameters by less than 0.5%, which facilitates rapid training, requiring just 4 hours on 8 GPUs.

To train HQ-SAM, the authors composed a dataset named HQSeg-44K, consisting of 44K fine-grained mask annotations. This dataset was meticulously curated to encompass a wide variety of semantic classes, ensuring robustness across different segmentation tasks.

Experimental Validation and Numerical Results

The efficacy of HQ-SAM is demonstrated through rigorous evaluation on 10 diverse segmentation datasets, spanning various tasks such as image and video segmentation. The model consistently outperformed SAM, particularly in preserving fine details and maintaining boundary accuracy. Apex performance was observed across multiple datasets, including COCO, UVO, and LVIS, where the model maintained its zero-shot capability while improving mask AP and boundary metrics.

The results indicated a notable improvement in boundary precision, a critical factor in applications demanding high accuracy, like robotic perception and image editing. HQ-SAM showed an enhancement in mBIoU and boundary AP metrics, validating its ability to handle complex scene structures that were previously problematic for SAM.

Theoretical and Practical Implications

From a theoretical standpoint, HQ-SAM exemplifies how minimal modifications to a pre-trained foundational model can lead to significant improvements in output quality without forfeiting generalization capabilities. The reutilization of SAM’s architecture, especially its ViT-based features, underscores the potential of efficient token learning and feature fusion in enhancing model performance.

Practically, HQ-SAM's advancements in segmentation precision have immediate implications for industries relying on computer vision for automation and augmented reality. The model's ability to produce detailed masks without extensive retraining costs makes it attractive for rapidly evolving technological landscapes where precision and adaptability are paramount.

Future Directions

Building upon HQ-SAM, future research could explore further optimization of computational efficiency, extending beyond current limits to suit real-time applications. Investigating the scalability of the HQ-Output Token to larger, more diverse datasets could expand its applicability to more extensive and complex segmentation tasks. Additionally, the fusion strategies and token adaptations introduced here could inspire adaptations in other foundational vision models beyond segmentation tasks.

In summary, HQ-SAM represents a precise and efficient enhancement of SAM, maintaining zero-shot generalization while significantly elevating mask quality. Its contributions pave the way for more refined visual understanding, marking an advancement in segmentation model capabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Lei Ke (31 papers)
Mingqiao Ye (4 papers)
Martin Danelljan (96 papers)
Yifan Liu (135 papers)
Yu-Wing Tai (123 papers)
Chi-Keung Tang (81 papers)
Fisher Yu (104 papers)

Citations (218)

View on Semantic Scholar

Segment Anything in High Quality (2306.01567v2)

Segment Anything in High Quality: An Expert Overview

Key Contributions and Methodology

Experimental Validation and Numerical Results

Theoretical and Practical Implications

Future Directions

Related Papers

GitHub

YouTube