Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2102.12122v2)

Published 24 Feb 2021 in cs.CV

Abstract: Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and Transformer, making it a unified backbone in various vision tasks without convolutions by simply replacing CNN backbones. (3) We validate PVT by conducting extensive experiments, showing that it boosts the performance of many downstream tasks, e.g., object detection, semantic, and instance segmentation. For example, with a comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCO dataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future researches. Code is available at https://github.com/whai362/PVT.

PDF Abstract

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

The paper "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions" introduces the Pyramid Vision Transformer (PVT), a purely Transformer-based model designed to serve as a backbone for a variety of computer vision tasks, including dense prediction tasks such as object detection and semantic segmentation. Unlike traditional convolutional neural networks (CNNs) and the recently introduced Vision Transformer (ViT), PVT combines the best of both worlds: the global receptive fields facilitated by Transformers and the hierarchical feature maps critical for dense predictions.

Core Innovations

Progressive Shrinking Pyramid: PVT introduces a progressive shrinking pyramid as opposed to the flat structure of ViT. This configuration enables the model to handle high-resolution feature maps in the initial layers and progressively lower resolution feature maps in deeper layers. This helps in managing computational and memory complexities effectively.
Spatial-Reduction Attention: The paper presents a novel Spatial-Reduction Attention (SRA) mechanism to address the issue of high computational and memory demands when processing large feature maps in conventional multi-head attention layers. SRA reduces the spatial dimension of key and value components, significantly lowering the processing overhead.
Versatility Across Tasks: PVT is applicable to multiple computer vision tasks without requiring convolutions. It demonstrates strong performance across image classification, object detection, and semantic and instance segmentation tasks.

Empirical Evaluation

The empirical evaluation of PVT benchmarks its performance against traditional CNN-based backbones like ResNet and ResNeXt as well as against the ViT model. The validation metrics show significant improvements, particularly in dense prediction tasks. For example, in object detection tasks using RetinaNet, PVT-Small achieves an AP of 40.4 on the COCO dataset, surpassing ResNet50 by 4.1 points. Similar performance boosts are observed in semantic and instance segmentation tasks.

Implications and Future Directions

Practical Implications:

Enhanced Flexibility: The ability to generate multi-scale feature maps makes PVT an attractive backbone for a variety of dense prediction tasks, reducing the need for task-specific adjustments.
Reduced Computational Load: By employing the SRA mechanism, PVT mitigates computational constraints, making it feasible to apply multi-headed attention to higher resolution inputs.
Convolution-Free Pipelines: The paper demonstrates that it is possible to build effective object detection and segmentation models without convolutions, paving the way for new paradigms in computer vision model design.

Theoretical Implications:

Generalized Feature Extraction: By integrating the strengths of pyramid architecture and Transformer models, PVT challenges the predominant CNN paradigms and sets a precedent for forthcoming research in vision-related AI.
Task-Agnostic Design: This work endorses the development of more task-agnostic backbone networks, capable of delivering high-level performance across various computer vision challenges through a unified architecture.

Speculations on Future Developments:

Extended Architectures: Future research may extend the PVT architecture to incorporate dilated convolutions, self-attention mechanisms, and advanced normalization techniques to further enhance performance.
Architecture Search for Transformers: Given the promising results of PVT, there may be a surge in efforts to automate the architecture design for Transformer-based models through Neural Architecture Search (NAS) techniques.
Applications Beyond Standard Vision Tasks: The versatility and efficacy of PVT could also find applications in areas like medical imaging, 3D object recognition, and even real-time video analysis, driven by the paradigm shift it proposes.

Conclusion

The Pyramid Vision Transformer (PVT) represents a substantial advance in the field of computer vision, presenting a viable alternative to widely adopted CNN backbones. By combining a pyramid structure with Transformer encoders and introducing the spatial-reduction attention mechanism, PVT achieves state-of-the-art results in dense prediction tasks while maintaining computational efficiency. This work significantly broadens the applicability of Transformer models in computer vision and sets the stage for future explorations in convolution-free architectures.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Wenhai Wang (123 papers)
Enze Xie (84 papers)
Xiang Li (1002 papers)
Deng-Ping Fan (88 papers)
Kaitao Song (46 papers)
Ding Liang (39 papers)
Tong Lu (85 papers)
Ping Luo (340 papers)
Ling Shao (244 papers)

Citations (3,214)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - whai362/PVT: Official implementation of PVT series (1,753 stars)