Less is More: Pay Less Attention in Vision Transformers (2105.14217v4)

Published 29 May 2021 in cs.CV

Abstract: Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works can be prohibitively expensive due to the quadratic complexity of self-attention over a long sequence of representations, especially for high-resolution dense prediction tasks. To this end, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that the early self-attention layers in Transformers still focus on local patterns and bring minor benefits in recent hierarchical vision Transformers. Specifically, we propose a hierarchical Transformer where we use pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation, serving as a strong backbone for many vision tasks. Code is available at: https://github.com/zhuang-group/LIT

Authors (5)

Zizheng Pan (23 papers)
Bohan Zhuang (79 papers)
Haoyu He (27 papers)
Jing Liu (527 papers)
Jianfei Cai (163 papers)

Citations (77)

View on Semantic Scholar

Summary

The paper's main contribution is proposing LIT, which reduces self-attention in early layers by using MLPs for local feature extraction.
It introduces a Deformable Token Merging module that adaptively fuses informative patches to address geometric variations in visual data.
Experimental results on ImageNet and COCO show that LIT achieves a balanced fusion of efficiency and performance, outperforming comparable models.

Less is More: Pay Less Attention in Vision Transformers

The work titled "Less is More: Pay Less Attention in Vision Transformers" by Zizheng Pan and colleagues offers a substantial contribution to the field of computer vision by proposing an alternative approach to the deployment of self-attention mechanisms in Vision Transformers (ViTs). Traditional Transformers have gained popularity in computer vision due to their ability to model long-range dependencies. Nevertheless, a significant challenge remains—the quadratic complexity of self-attention, especially concerning high-resolution and dense prediction tasks.

Core Contributions

Hierarchical Design with Layer-Specific Attention: The authors propose a novel architecture, termed the Less attention vIsion Transformer (LIT), which strategically reduces the reliance on self-attention in the early layers of the network. They suggest employing multi-layer perceptrons (MLPs) to encode local patterns in the early stages, transitioning to conventional self-attention mechanisms in the deeper layers. This design stems from the observation that shallow layers typically focus on local patterns, often redundant with the costly self-attention mechanism's capability in these stages.
Deformable Token Merging: An innovative component of LIT is the Deformable Token Merging (DTM) module, inspired by deformable convolutions. This module allows for adaptive merging of informative patches, enhancing the model’s ability to handle geometric variations and transformations within the data. The DTM module, by learning spatial offsets, optimizes the sampling for patch merging beyond the rigid grid arrangement typical of traditional methods.
Efficiency and Performance Balance: Through careful allocation of computational resources—focusing MLPs on local pattern detection and leveraging self-attention only when modeling global dependencies becomes crucial—LIT promises a balanced fusion of efficiency and performance. The experimental results on ImageNet show the proposed approach achieves competitive performance with significantly reduced computational demands.

Experimental Validation

LIT achieves commendable results across various tasks and datasets. On ImageNet's classification task, LIT models consistently outperform comparable counterparts like PVT and Swin Transformers in terms of efficiency, delivering better throughput and reduced memory consumption while maintaining or surpassing previous accuracy benchmarks. Specifically, the LIT-Ti variant surpasses PVT-S by a 1.3% margin in top-1 accuracy at a lower computational cost (0.2G less FLOPs).

The benefits of LIT extend to applications in object detection and instance segmentation, validated on the COCO dataset with both the RetinaNet and Mask R-CNN frameworks. It outperforms several state-of-the-art models, again showcasing the effective synergy of local-global feature learning strategies inherent in its design.

Theoretical and Practical Implications

The theoretical underpinnings of LIT challenge the necessity of employing full self-attention mechanisms in early ViT layers. This insight opens a new avenue for exploring hierarchical architectures that balance the depth of local versus global feature learning—key for computational efficiency.

Furthermore, the deformable token merging approach shifts the paradigm from static to dynamic patch sampling, potentially benefiting other domains where spatial transformations influence predictive accuracy. This scalable and adaptable framework may inspire similar methodologies across varied computing platforms and tasks.

Future Directions

The introduction of LIT suggests several paths for future research. First, there is the potential for integrating Neural Architecture Search (NAS) techniques to explore optimal configurations of MLP- and attention-based layers within LIT. Furthermore, extending efficient attention mechanisms, such as kernelization or sparsity techniques, could enrich subsequent refinements of the architecture. Exploring LIT’s impact across other domains or tasks, especially those requiring real-time adaptation to spatial features, remains a promising venture.

In sum, "Less is More" exemplifies an effective recalibration of vision transformer architectures, steering towards pragmatic configurations that respect computational budgets without sacrificing the deep representational capabilities that transformers afford.

Related Papers

GitHub

GitHub - ziplab/LIT: [AAAI 2022] This is the official PyTorch implementation of "Less is More: Pay Less Attention in Vision Transformers" (90 stars)