Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Visual Saliency Transformer (2104.12099v2)

Published 25 Apr 2021 in cs.CV

Abstract: Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing methods on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models. Code is available at https://github.com/nnizhang/VST.

Citations (310)

Summary

  • The paper introduces a convolution-free Transformer paradigm that redefines salient object detection by eliminating traditional CNN operations.
  • It employs innovative multi-level token fusion and token upsampling techniques, achieving superior detection accuracy on benchmark datasets like DUTLF-Depth and NJUD.
  • The approach integrates joint saliency and boundary detection using a patch-task-attention mechanism, paving the way for broader applications in computer vision.

Visual Saliency Transformer: A Convolution-Free Approach to Salient Object Detection

The paper "Visual Saliency Transformer" presents a novel exploration into salient object detection (SOD) leveraging a pure transformer framework. This marks a departure from the convolutional neural network (CNN)-based approaches that have traditionally dominated the field. The authors propose using a Visual Saliency Transformer (VST) to address the challenges of RGB and RGB-D saliency detection by modeling long-range dependencies inherent in visual data, providing a refreshed perspective on dense prediction tasks.

The core innovation of this work lies in reimagining SOD from a sequence-to-sequence perspective, eliminating the convolution operations. The VST paradigm introduces several novel components, including a multi-level token fusion strategy and a unique token upsampling method, which are crucial for high-resolution saliency prediction. Further, the paper integrates a multi-task learning approach using task-related tokens and a patch-task-attention mechanism, facilitating joint saliency and boundary detection. These insights form the theoretical foundation for transformer architectures in the domain of dense prediction.

Numerical results from exhaustive experiments demonstrate the superiority of VST models against established CNN-based methods across various benchmark datasets. For example, on datasets such as DUTLF-Depth and NJUD, VST achieves higher structural measure (S_m) and maximum F-measure (maxF) values, alongside lower mean absolute errors (MAE), showcasing improved detection accuracy and computational efficiency.

One bold implication of this work is the paradigm shift it suggests for computer vision tasks. By eschewing convolution, the VST sets a precedent for leveraging transformer architectures in other dense prediction tasks, potentially simplifying model architecture while improving performance and efficiency. This work challenges existing methodologies and opens a discourse on optimizing and applying transformer models across diverse vision tasks.

The future directions in AI, as suggested by this paper, hint at an exploration into deeper, more efficient token-based representations and further refinement of cross-modality interaction tactics for tasks extending beyond SOD. The presented transformer architecture can inspire broader application in tasks involving complex spatial structures and dependencies, such as depth estimation and semantic segmentation, potentially revolutionizing approaches across vision AI.

In conclusion, the Visual Saliency Transformer represents a significant stride towards integrating transformer architectures into vision-based tasks traditionally confined to CNNs. While the paper's primary focus is SOD, the approaches and findings hold potential transformative impacts across broader computer vision realms.