Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt for Segmenting Camouflaged Objects (2312.07374v3)

Published 12 Dec 2023 in cs.CV

Abstract: Camouflaged object detection (COD) approaches heavily rely on pixel-level annotated datasets. Weakly-supervised COD (WSCOD) approaches use sparse annotations like scribbles or points to reduce annotation effort, but this can lead to decreased accuracy. The Segment Anything Model (SAM) shows remarkable segmentation ability with sparse prompts like points. However, manual prompt is not always feasible, as it may not be accessible in real-world application. Additionally, it only provides localization information instead of semantic one, which can intrinsically cause ambiguity in interpreting the targets. In this work, we aim to eliminate the need for manual prompt. The key idea is to employ Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts using the semantic information given by a generic text prompt. To that end, we introduce a test-time adaptation per-instance mechanism called Generalizable SAM (GenSAM) to automatically enerate and optimize visual prompts the generic task prompt for WSCOD. In particular, CCTP maps a single generic text prompt onto image-specific consensus foreground and background heatmaps using vision-LLMs, acquiring reliable visual prompts. Moreover, to test-time adapt the visual prompts, we further propose Progressive Mask Generation (PMG) to iteratively reweight the input image, guiding the model to focus on the targets in a coarse-to-fine manner. Crucially, all network parameters are fixed, avoiding the need for additional training. Experiments demonstrate the superiority of GenSAM. Experiments on three benchmarks demonstrate that GenSAM outperforms point supervision approaches and achieves comparable results to scribble supervision ones, solely relying on general task descriptions as prompts. our codes is in: https://lwpyh.github.io/GenSAM/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. On training sample memorization: Lessons from benchmarking generative modeling with a large-scale competition. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2534–2542.
  2. The ability of Segmenting Anything Model (SAM) to segment ultrasound images. BioScience Trends.
  3. SAM Fails to Segment Anything?–SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, and More. arXiv preprint arXiv:2304.09148.
  4. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE international conference on computer vision, 4548–4557.
  5. Concealed object detection. IEEE transactions on pattern analysis and machine intelligence, 44(10): 6024–6042.
  6. Cognitive vision inspired object segmentation metric and loss function. Scientia Sinica Informationis, 6(6).
  7. Camouflaged object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2777–2787.
  8. Pranet: Parallel reverse attention network for polyp segmentation. In International conference on medical image computing and computer-assisted intervention, 263–273. Springer.
  9. Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping. arXiv preprint arXiv:2305.11003.
  10. Weakly-supervised camouflaged object detection with scribble annotations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 781–789.
  11. Detection of the mobile object with camouflage color under dynamic background based on optical flow. Procedia Engineering, 15: 2201–2205.
  12. Multi-Weight Partial Domain Adaptation. In BMVC, 5.
  13. Discriminative partial domain adversarial network. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, 632–648. Springer.
  14. Learning Unbiased Transferability for Domain Adaptation by Uncertainty Modeling. In European Conference on Computer Vision, 223–241. Springer.
  15. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1): 106.
  16. SAM Struggles in Concealed Scenes–Empirical Study on” Segment Anything”. arXiv preprint arXiv:2304.06022.
  17. Segment anything is not always perfect: An investigation of sam on different real-world applications. arXiv preprint arXiv:2304.05750.
  18. Segment anything. arXiv preprint arXiv:2304.02643.
  19. Anabranch network for camouflaged object segmentation. Computer vision and image understanding, 184: 45–56.
  20. Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653.
  21. Generated knowledge prompting for commonsense reasoning. arXiv preprint arXiv:2110.08387.
  22. How to evaluate foreground maps? In Proceedings of the IEEE conference on computer vision and pattern recognition, 248–255.
  23. The norm must go on: Dynamic unsupervised domain adaptation by normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14765–14775.
  24. Efficient test-time model adaptation without forgetting. In International conference on machine learning, 16888–16905. PMLR.
  25. Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2160–2170.
  26. Early evolution and ecology of camouflage in insects. Proceedings of the National Academy of Sciences, 109(52): 21414–21419.
  27. Pike, T. W. 2018. Quantifying camouflage and conspicuousness using visual salience. Methods in Ecology and Evolution, 9(8): 1883–1895.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  29. Performance of decamouflaging through exploratory image analysis. In 2008 First International Conference on Emerging Trends in Engineering and Technology, 6–10. IEEE.
  30. Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. International journal of computer assisted radiology and surgery, 9: 283–293.
  31. Animal camouflage analysis: Chameleon database. Unpublished manuscript, 2(6): 7.
  32. Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709.
  33. Large-scale training of shadow detectors with noisily-annotated shadow examples. In ECCV, 816–832. Springer.
  34. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726.
  35. Dynamically Instance-Guided Adaptation: A Backward-Free Approach for Test-Time Domain Adaptive Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 24090–24099.
  36. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  37. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824–24837.
  38. Structure-consistent weakly supervised salient object detection with local saliency coherence. In Proceedings of the AAAI conference on artificial intelligence, volume 35, 3234–3242.
  39. Weakly-supervised salient object detection via scribble annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12546–12555.
  40. Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594.
  41. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718.
Citations (14)

Summary

  • The paper presents GenSAM, a method that automates visual prompt generation using a generic text prompt for camouflaged object segmentation.
  • It introduces Cross-modal Chains of Thought Prompting and Progressive Mask Generation to iteratively refine segmentation without manual annotation.
  • Experimental results show that GenSAM outperforms point-based methods and rivals scribble supervision in both efficiency and accuracy.

Introduction to Camouflaged Object Detection

Camouflaged Object Detection (COD) focuses on identifying objects that blend with their surroundings, which is crucial for various practical applications. Traditionally, this task has required models to be trained on precisely annotated datasets, which is a labor-intensive procedure.

The Challenge of Sparse Annotations

Recently, the field has seen a shift towards weakly supervised approaches, which use sparser annotations to reduce the human effort involved. However, these methods often face a trade-off, where less annotation leads to lower accuracy. The Segment Anything Model (SAM) offers promise in addressing this, especially when used with prompts like points, which provide segmentation cues. Nonetheless, reliance on manually selected prompts is both impractical in real-world scenarios and introduces potential ambiguity during interpretation, as manual prompts are subject to variance and may lack complete semantic information.

GenSAM: A Novel Approach

In this paper, a new mechanism called Generalizable SAM (GenSAM) is advanced to automatically generate and optimize visual prompts based on a single generic text prompt suitable for weakly supervised COD. GenSAM utilizes Cross-modal Chains of Thought Prompting (CCTP) to derive semantically rich visual prompts from generic task descriptions without manual intervention.

Moreover, a method called Progressive Mask Generation (PMG) is proposed. This test-time prompt tuning method iteratively refines the focus of the model with each image, avoiding the need for manual annotations specific to each instance. GenSAM requires only a high-level task description and performs segmentation of camouflaged objects across different datasets without instance-specific prompts.

Demonstrated Results

Experimental results across multiple benchmarks show GenSAM’s superiority. GenSAM outperforms existing point supervision methods and is competitive with scribble supervision despite using only general task descriptions as prompts. The method’s performance exemplifies its robustness in real-world scenarios where specific manual prompts may not be feasible or available.

Through this research, GenSAM paves the way for more accessible and generalizable concealed object detection, offering a significant step forward in the development of automated visual systems that can perform complex segmentation tasks with minimal human input.

Github Logo Streamline Icon: https://streamlinehq.com