SegViT: Semantic Segmentation with Plain Vision Transformers (2210.05844v2)

Published 12 Oct 2022 in cs.CV

Abstract: We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40\%$ computations while maintaining competitive performance.

Authors (7)

Bowen Zhang (161 papers)
Zhi Tian (68 papers)
Quan Tang (29 papers)
Xiangxiang Chu (62 papers)
Xiaolin Wei (42 papers)
Chunhua Shen (404 papers)
Yifan Liu (135 papers)

Citations (102)

View on Semantic Scholar

Summary

Overview of "SegViT: Semantic Segmentation with Plain Vision Transformers"

The paper presents a novel approach to semantic segmentation using plain Vision Transformers (ViTs) without the hierarchical structures typically employed in convolutional neural networks (CNNs). The proposed framework, named SegViT, introduces an efficient Attention-to-Mask (ATM) module for effective semantic segmentation. Unlike conventional methods that rely on pixel-level classification of ViT outputs, SegViT leverages the intrinsic attention mechanism of ViTs to produce segmentation masks directly.

Technical Contributions and Methodology

The core innovation of this work is the ATM module, which transforms similarity maps generated by the attention mechanism into segmentation masks. These masks represent regions within the image that correspond to predefined class tokens, providing a new paradigm for semantic segmentation. Specifically, the process involves a transformer block where learnable class tokens serve as queries, while the spatial feature maps serve as keys and values. The similarity between the queries (class tokens) and keys (spatial features) is computed using a dot-product operation. This similarity map is then converted to segmentation masks via a sigmoid function, while classification predictions are derived from the updated class tokens.

The SegViT architecture utilizes multiple ATM modules in a cascade manner across different layers of the ViT, improving segmentation performance by incorporating multi-layer features. Additionally, to address the computational cost typically associated with plain ViT backbones, a Shrunk structure is proposed. This structure incorporates query-based down-sampling (QD) and up-sampling (QU) modules to reduce computations by up to 40% while maintaining competitive performance.

Results and Implications

Empirical evaluations on datasets such as ADE20K, PASCAL-Context, and COCO-Stuff-10K demonstrate that SegViT achieves state-of-the-art performance compared to existing ViT-based segmentation methods. Notably, the architecture achieves a mean Intersection over Union (mIoU) of 55.2% on ADE20K with the ViT-Large backbone, surpassing other state-of-the-art methods with similar backbones. Furthermore, the Shrunk version achieves nearly equivalent performance with significantly reduced computational demand, indicating the efficiency of the proposed optimizations.

The introduction of the ATM module and the Shrunk structure holds profound implications for the design of semantic segmentation architectures using transformers. By exploiting the spatial information in attention maps, the necessity for high-computational processes such as feature pyramids in hierarchical models is mitigated. This paves the way for more lightweight and scalable solutions in applications where computational resources might be a constraint.

Future Directions

This research sets a promising precedent for future developments in semantic segmentation using transformer-based architectures. It encourages further exploration into optimizing non-hierarchical backbones for dense prediction tasks. One potential avenue for future work is enhancing the scalability of the ATM module for even more computationally efficient performance. Moreover, adapting the principles outlined in this paper to other dense prediction tasks in computer vision, such as instance or panoptic segmentation, could yield fruitful results. The broader applicability of these techniques across various datasets and backbones could also be a subject of further investigation.

In conclusion, the SegViT approach showcases the versatility and capability of Vision Transformers in semantic segmentation, marking an advancement in the efficient utilization of attention mechanisms within plain transformer frameworks.

PDF Markdown

Related Papers

GitHub

GitHub - zbwxp/SegVit: Official Pytorch Implementation of SegViT: Semantic Segmentation with Plain Vision Transformers (185 stars)