QuadTree Attention for Vision Transformers (2201.02767v2)

Published 8 Jan 2022 in cs.CV

Abstract: Transformers have been successful in many vision tasks, thanks to their capability of capturing long-range dependency. However, their quadratic computational complexity poses a major obstacle for applying them to vision tasks requiring dense predictions, such as object detection, feature matching, stereo, etc. We introduce QuadTree Attention, which reduces the computational complexity from quadratic to linear. Our quadtree transformer builds token pyramids and computes attention in a coarse-to-fine manner. At each level, the top K patches with the highest attention scores are selected, such that at the next level, attention is only evaluated within the relevant regions corresponding to these top K patches. We demonstrate that quadtree attention achieves state-of-the-art performance in various vision tasks, e.g. with 4.0% improvement in feature matching on ScanNet, about 50% flops reduction in stereo matching, 0.4-1.5% improvement in top-1 accuracy on ImageNet classification, 1.2-1.8% improvement on COCO object detection, and 0.7-2.4% improvement on semantic segmentation over previous state-of-the-art transformers. The codes are available at https://github.com/Tangshitao/QuadtreeAttention.

PDF Abstract

Quadtree Attention for Vision Transformers

The paper presents a novel approach for enhancing the computational efficiency of vision transformers through a mechanism termed Quadtree Attention. This method addresses the inherent quadratic complexity of transformers, which poses significant challenges when applied to high-resolution vision tasks like object detection and stereo matching. By reducing this complexity to a linear scale, Quadtree Attention facilitates broader applicability and scalability of transformers in vision applications.

Methodology

Quadtree Attention introduces a hierarchical approach to token processing by constructing token pyramids that operate through multiple levels of resolution. Instead of evaluating attention globally across an entire image, the method computes attention scores in a coarse-to-fine manner:

Coarse-to-Fine Attention: At each pyramid level, the top $K$ patches with the highest attention scores are identified. Subsequent computations focus exclusively on these top-scoring patches, thereby limiting attention evaluation to the most relevant regions.
Token Pyramids: The method builds pyramids by down-sampling query, key, and value tokens, enabling sparse attention mechanisms and reducing computational demands.
QuadTree Formulation: This process naturally leads to a quadtree data structure, allowing the model to retain long-range dependencies while capturing fine-grained details efficiently.

Two architectural variants, QuadTree-A and QuadTree-B, are explored, with the latter showing superior empirical results due to improved weighting and aggregation strategies.

Empirical Results

The Quadtree Attention approach has demonstrated significant performance improvements across various vision tasks:

Feature Matching: Achieved a 4.0% improvement on ScanNet, enhancing the quality of camera pose estimation.
Stereo Matching: Reduced computational overhead by 50% in terms of FLOPs while maintaining competitive End-Point-Error (EPE).
Image Classification: Improved top-1 accuracy on ImageNet by up to 1.5% over previous state-of-the-art transformers.
Object Detection and Semantic Segmentation: Enhanced performance on COCO with up to a 1.8% improvement in average precision.

The paper provides comprehensive evaluations and comparisons with other efficient transformer models, such as Swin and PVT, showcasing Quadtree Attention's consistent advantages in accuracy and efficiency.

Implications and Future Work

The proposed method offers a significant step toward making transformers more applicable to vision tasks involving high-resolution data. The reduction in computational complexity from quadratic to linear marks an advancement in making vision transformers more viable for real-time and resource-constrained environments. Moreover, the efficient handling of fine and coarse-level details enhances the model's ability to make more detailed and accurate predictions.

Future work could explore the integration of Quadtree Attention with emerging transformer architectures and its extension to other domains requiring high-resolution input processing. Additionally, optimizing GPU implementations can further enhance runtime efficiency, making this approach even more attractive for practical applications.

In conclusion, Quadtree Attention provides a robust and efficient method for improving vision transformers, facilitating their broader adoption in computer vision research and industry applications.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Shitao Tang (15 papers)
Jiahui Zhang (64 papers)
Siyu Zhu (64 papers)
Ping Tan (101 papers)

Citations (136)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Tangshitao/QuadTreeAttention: QuadTree Attention for Vision Transformers (ICLR2022) (327 stars)