Explicit Visual Prompting for Universal Foreground Segmentations (2305.18476v1)

Published 29 May 2023 in cs.CV

Abstract: Foreground segmentation is a fundamental problem in computer vision, which includes salient object detection, forgery detection, defocus blur detection, shadow detection, and camouflage object detection. Previous works have typically relied on domain-specific solutions to address accuracy and robustness issues in those applications. In this paper, we present a unified framework for a number of foreground segmentation tasks without any task-specific designs. We take inspiration from the widely-used pre-training and then prompt tuning protocols in NLP and propose a new visual prompting model, named Explicit Visual Prompting (EVP). Different from the previous visual prompting which is typically a dataset-level implicit embedding, our key insight is to enforce the tunable parameters focusing on the explicit visual content from each individual image, i.e., the features from frozen patch embeddings and high-frequency components. Our method freezes a pre-trained model and then learns task-specific knowledge using a few extra parameters. Despite introducing only a small number of tunable parameters, EVP achieves superior performance than full fine-tuning and other parameter-efficient fine-tuning methods. Experiments in fourteen datasets across five tasks show the proposed method outperforms other task-specific methods while being considerably simple. The proposed method demonstrates the scalability in different architectures, pre-trained weights, and tasks. The code is available at: https://github.com/NiFangBaAGe/Explicit-Visual-Prompt.

Authors (4)

Weihuang Liu (8 papers)
Xi Shen (46 papers)
Chi-Man Pun (75 papers)
Xiaodong Cun (61 papers)

Citations (13)

View on Semantic Scholar

Summary

Analysis of Explicit Visual Prompting for Universal Foreground Segmentations

The paper "Explicit Visual Prompting for Universal Foreground Segmentations" introduces an innovative approach for tackling a range of foreground segmentation tasks, leveraging the concept of visual prompting within a unified framework. This concept draws inspiration from the field of natural language processing, where prompting has already proven its utility in adapting pre-trained models to various downstream applications with minimal parameter modifications. Explicitly, the method named Explicit Visual Prompting (EVP) is proposed to address multiple segmentation tasks, inclusive of salient object detection, forgery detection, defocus blur detection, shadow detection, and camouflaged object detection.

Methodological Overview

At the core of this research is the EVP model, which harnesses a pre-trained vision backbone, remains frozen, and learns task-specific knowledge through a limited set of additional parameters. The approach derives its potency from learning explicit prompts via two principal sources: the patch embeddings from the original frozen layers of the vision transformer and the high-frequency components of the images. The latter are considered vital as many pre-trained models tend to be invariant to these features due to common data augmentations applied during training.

Two variants of EVP, namely EVPv1 and EVPv2, are illustrated. EVPv1 manually extracts high-frequency components using a fixed mask applied in the frequency domain to delineate these components from the input image. Subsequently, features are tuned through a linear layer within an adaptor. In contrast, EVPv2 introduces a more automated approach via Fourier MLP, generating the adaptive masks for spectra, thus eliminating manual high-frequency extraction and making the solution end-to-end differentiable.

Experimental Evaluation and Results

Extensive experiments across fourteen datasets for five distinct tasks underscore the superiority of EVP over full fine-tuning methods and other parameter-efficient approaches. The empirical results show that EVP can consistently outperform task-specific solutions while maintaining a reduced model footprint in terms of trainable parameters. The model utilizes architectures like the Vision Transformer (ViT) and hierarchical structures such as SegFormer, indicating its scalability and compatibility with various architectural paradigms.

Notably, the research presents strong numerical results that illustrate the efficacy of EVP in real-world scenarios. For instance, in salient object detection tasks, EVP achieves compelling results across datasets like DUTS, ECSSD, HKU-IS, with mean E-measure and F-measure outperforming prevailing models. Similarly, forgery detection, camouflaged object detection, and shadow detection tasks also record marked improvements, exemplifying the versatility and robustness of the EVP framework.

Implications and Speculation on Future Developments

The EVP framework's capability to generalize across multiple low-level segmentation tasks heralds substantial implications for future research in visual prompting and computer vision at large. The methodology proves to reduce dependency on task-specific designs, highlighting a paradigm shift toward universal frameworks that can reuse and adapt foundational models across a wide variety of applications.

Looking forward, the EVP model opens avenues for further exploration in prompt design and efficient tuning strategies, potentially expanding its utility beyond segmentation tasks to other complex vision challenges. Moreover, as visual transformers continue to evolve, integrating EVP might bolster their ability to simultaneously handle multiple vision tasks within unified frameworks, enhancing both computational efficiency and performance.

In conclusion, "Explicit Visual Prompting for Universal Foreground Segmentations" makes a significant contribution to the field by advancing the concept of visual prompting, demonstrating its practical value in segmentation tasks, and laying groundwork for future explorations of efficient adaptation methods in computer vision.

PDF Markdown

Related Papers

GitHub

GitHub - NiFangBaAGe/Explicit-Visual-Prompt: [CVPR 2023] Explicit Visual Prompting for Low-Level Structure Segmentations (208 stars)