Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt for Segmenting Camouflaged Objects (2312.07374v3)

Published 12 Dec 2023 in cs.CV

Abstract: Camouflaged object detection (COD) approaches heavily rely on pixel-level annotated datasets. Weakly-supervised COD (WSCOD) approaches use sparse annotations like scribbles or points to reduce annotation effort, but this can lead to decreased accuracy. The Segment Anything Model (SAM) shows remarkable segmentation ability with sparse prompts like points. However, manual prompt is not always feasible, as it may not be accessible in real-world application. Additionally, it only provides localization information instead of semantic one, which can intrinsically cause ambiguity in interpreting the targets. In this work, we aim to eliminate the need for manual prompt. The key idea is to employ Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts using the semantic information given by a generic text prompt. To that end, we introduce a test-time adaptation per-instance mechanism called Generalizable SAM (GenSAM) to automatically enerate and optimize visual prompts the generic task prompt for WSCOD. In particular, CCTP maps a single generic text prompt onto image-specific consensus foreground and background heatmaps using vision-LLMs, acquiring reliable visual prompts. Moreover, to test-time adapt the visual prompts, we further propose Progressive Mask Generation (PMG) to iteratively reweight the input image, guiding the model to focus on the targets in a coarse-to-fine manner. Crucially, all network parameters are fixed, avoiding the need for additional training. Experiments demonstrate the superiority of GenSAM. Experiments on three benchmarks demonstrate that GenSAM outperforms point supervision approaches and achieves comparable results to scribble supervision ones, solely relying on general task descriptions as prompts. our codes is in: https://lwpyh.github.io/GenSAM/.

References (41)

Citations (14)

View on Semantic Scholar

Summary

The paper presents GenSAM, a method that automates visual prompt generation using a generic text prompt for camouflaged object segmentation.
It introduces Cross-modal Chains of Thought Prompting and Progressive Mask Generation to iteratively refine segmentation without manual annotation.
Experimental results show that GenSAM outperforms point-based methods and rivals scribble supervision in both efficiency and accuracy.

Introduction to Camouflaged Object Detection

Camouflaged Object Detection (COD) focuses on identifying objects that blend with their surroundings, which is crucial for various practical applications. Traditionally, this task has required models to be trained on precisely annotated datasets, which is a labor-intensive procedure.

The Challenge of Sparse Annotations

Recently, the field has seen a shift towards weakly supervised approaches, which use sparser annotations to reduce the human effort involved. However, these methods often face a trade-off, where less annotation leads to lower accuracy. The Segment Anything Model (SAM) offers promise in addressing this, especially when used with prompts like points, which provide segmentation cues. Nonetheless, reliance on manually selected prompts is both impractical in real-world scenarios and introduces potential ambiguity during interpretation, as manual prompts are subject to variance and may lack complete semantic information.

GenSAM: A Novel Approach

In this paper, a new mechanism called Generalizable SAM (GenSAM) is advanced to automatically generate and optimize visual prompts based on a single generic text prompt suitable for weakly supervised COD. GenSAM utilizes Cross-modal Chains of Thought Prompting (CCTP) to derive semantically rich visual prompts from generic task descriptions without manual intervention.

Moreover, a method called Progressive Mask Generation (PMG) is proposed. This test-time prompt tuning method iteratively refines the focus of the model with each image, avoiding the need for manual annotations specific to each instance. GenSAM requires only a high-level task description and performs segmentation of camouflaged objects across different datasets without instance-specific prompts.

Demonstrated Results

Experimental results across multiple benchmarks show GenSAM’s superiority. GenSAM outperforms existing point supervision methods and is competitive with scribble supervision despite using only general task descriptions as prompts. The method’s performance exemplifies its robustness in real-world scenarios where specific manual prompts may not be feasible or available.

Through this research, GenSAM paves the way for more accessible and generalizable concealed object detection, offering a significant step forward in the development of automated visual systems that can perform complex segmentation tasks with minimal human input.

PDF Markdown

Related Papers

GitHub

Generalizable SAM