Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vision Transformers: From Semantic Segmentation to Dense Prediction (2207.09339v4)

Published 19 Jul 2022 in cs.CV

Abstract: The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image patches, in comparison to the increasing receptive fields of CNNs across layers and other alternatives (e.g., large kernels and atrous convolution). In this work, for the first time we explore the global context learning potentials of ViTs for dense visual prediction (e.g., semantic segmentation). Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information, critical for dense prediction tasks. We first demonstrate that encoding an image as a sequence of patches, a vanilla ViT without local convolution and resolution reduction can yield stronger visual representation for semantic segmentation. For example, our model, termed as SEgmentation TRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on the day of submission) and performs competitively on Cityscapes. However, the basic ViT architecture falls short in broader dense prediction applications, such as object detection and instance segmentation, due to its lack of a pyramidal structure, high computational demand, and insufficient local context. For tackling general dense visual prediction tasks in a cost-effective manner, we further formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture. Extensive experiments show that our methods achieve appealing performance on a variety of dense prediction tasks (e.g., object detection and instance segmentation and semantic segmentation) as well as image classification.

Vision Transformers: From Semantic Segmentation to Dense Prediction

The paper "Vision Transformers: From Semantic Segmentation to Dense Prediction" presents a methodological exploration into the use of Vision Transformers (ViTs) for dense visual prediction tasks, including semantic segmentation, object detection, and instance segmentation. The research leverages the global context learning abilities of ViTs to outperform traditional convolutional neural networks (CNNs), offering a strategic alternative to the latter's limitation in dealing with long-range dependencies due to their progressively increasing receptive fields.

Key Contributions

The authors introduce a novel architecture termed SEgmentation TRansformer (SETR), which adapts the ViT from image classification to dense prediction tasks. Unlike CNN-based fully convolutional networks (FCNs), which require various modifications to achieve a larger receptive field, SETR processes images as sequences of patches with full receptive fields at every layer, allowing for comprehensive context modeling without spatial resolution reduction.

Several innovative features characterize their approach:

  • Transformers for Dense Prediction: SETR leverages the self-attention mechanism, providing each layer with a global receptive field capable of capturing long-range dependencies more effectively than CNNs.
  • Hierarchical Local-Global (HLG) Transformers: To efficiently tackle dense visual prediction while controlling computational costs, the paper proposes HLG Transformers. This architecture integrates local attention within windows and global attention across windows within a pyramidal structure, enhancing both local feature extraction and global context modeling.
  • Semantic Segmentation Performance: The SETR model demonstrated superior performance across major semantic segmentation benchmarks such as ADE20K, Pascal Context, and Cityscapes datasets. The implementation of progressive upsampling and multi-level feature aggregation further boosted its performance.

Experimental Results

The paper reports compelling numerical results:

  • On the ADE20K dataset, SETR achieved mIoU scores of 50.09% with multi-scale inference, surpassing several state-of-the-art models.
  • On the Cityscapes validation set, the model achieved 82.15% mIoU with multi-scale inference.
  • The HLG Transformers, with different model sizes, showed a range of successful applications in both supervised vision tasks and semantic segmentation against diverse benchmarks, outperforming notable models like Swin Transformers across various metrics.

Theoretical and Practical Implications

The research delineated in this paper illustrates a noteworthy shift in the approach to visual representation learning. With ViTs providing a robust alternative to the convolution-based paradigms, this work lays the foundation for more explorations into non-local, self-attentive architectures in computer vision. The ability to capture rich, long-range dependencies stands to benefit applications requiring fine-grained analysis and deep understanding of complex scenes.

Practically, the versatility shown in both lightweight and large-scale architectures affirms the potential integration of HLG Transformers across various real-world applications such as autonomous driving, medical imaging, and video surveillance.

Future Directions

Given the flexibility and effective performance of the proposed architectures, future avenues of research could explore:

  • Scaling Transformers for even larger image resolutions and more intricate semantic tasks.
  • Further optimizations in computational efficiency and real-time applicability.
  • Aspects of self-supervised or transfer learning within the context of ViTs for diverse visual tasks.

Overall, this paper significantly contributes to the ongoing discourse on the applicability of Transformers beyond NLP and into areas traditionally dominated by convolutional approaches, reinforcing the scope and impact of ViTs within the field of computer vision.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Li Zhang (690 papers)
  2. Jiachen Lu (16 papers)
  3. Sixiao Zheng (10 papers)
  4. Xinxuan Zhao (1 paper)
  5. Xiatian Zhu (139 papers)
  6. Yanwei Fu (199 papers)
  7. Tao Xiang (324 papers)
  8. Jianfeng Feng (57 papers)
  9. Philip H. S. Torr (219 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com