Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations (2202.07800v2)

Published 16 Feb 2022 in cs.CV and cs.LG

Abstract: Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these image tokens brings redundant computations since not all the tokens are attentive in MHSA. Examples include that tokens containing semantically meaningless or distractive image backgrounds do not positively contribute to the ViT predictions. In this work, we propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training. For each forward inference, we identify the attentive image tokens between MHSA and FFN (i.e., feed-forward network) modules, which is guided by the corresponding class token attention. Then, we reorganize image tokens by preserving attentive image tokens and fusing inattentive ones to expedite subsequent MHSA and FFN computations. To this end, our method EViT improves ViTs from two perspectives. First, under the same amount of input image tokens, our method reduces MHSA and FFN computation for efficient inference. For instance, the inference speed of DeiT-S is increased by 50% while its recognition accuracy is decreased by only 0.3% for ImageNet classification. Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images. An example is that we improve the recognition accuracy of DeiT-S by 1% for ImageNet classification at the same computational cost of a vanilla DeiT-S. Meanwhile, our method does not introduce more parameters to ViTs. Experiments on the standard benchmarks show the effectiveness of our method. The code is available at https://github.com/youweiliang/evit

PDF Abstract

Expediting Vision Transformers via Token Reorganizations

Vision Transformers (ViTs) have become prominent in computer vision tasks due to their flexibility and performance in modeling long-range dependencies. However, this advancement comes with the downside of excessive computational costs, primarily due to the treatment of every image patch as a token in the Multi-Head Self-Attention (MHSA) framework. In this context, the paper "Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations" presents a method called EViT, aiming to mitigate these computational burdens by reorganizing image tokens, maintaining performance while reducing computational load.

Proposed Method: EViT

EViT introduces a novel approach of token reorganization specifically incorporated into the ViT architecture during training. It recognizes that not all tokens significantly contribute to the model's predictions, particularly those comprising semantically irrelevant or distracting image backgrounds. The methodology is two-fold: it first identifies attentive image tokens using the attention scores from the class token and then reorganizes the remaining tokens by preserving attentively scored tokens while fusing the less informative ones.

Key Aspects and Results

Attentive Token Identification: EViT employs the attention values from the class token to other tokens to identify the most informative parts of the image. These attentiveness scores guide the selection of tokens to retain, based on which the less important ones are fused. Specifically, this is done in the intermediate stages of the ViT model, substantially reducing the number of tokens forwarded to subsequent layers.
Efficiency Gains: The implementation of EViT results in significant computational savings. For example, EViT applied to DeiT-S improved inference speed by 50% with only a 0.3% drop in recognition accuracy on ImageNet, illustrating the method's ability to maintain accuracy with lesser computational resources.
Higher Resolution Inputs: By reducing computational overhead, EViT also allows for processing high-resolution inputs under similar computational constraints as a standard ViT model processing lower-resolution inputs. Consequently, DeiT-S's recognition accuracy increased by 1% on ImageNet classification when fed with higher-resolution images, maintaining the same computational cost as a vanilla setup.
No Additional Parameters: A key advantage of the method is its lightweight nature; it introduces no additional parameters to the existing ViTs, making it a practical and direct enhancement to current models without complicating the training procedure.

Implications and Future Directions

The exploration of token reorganizations within ViTs opens new directions for enhancing efficiency in transformer-based models. The proposed strategy presents a robust path for integrating attention-guided computations, which could pave the way to further innovations in token handling and resource-efficient AI models. This might extend not only to static computer vision tasks but could also inform real-time inference and broader applications in object detection and semantic segmentation under resource-constrained environments.

In conclusion, the paper provides a substantial contribution to the field by optimizing ViT architectures without the need for architectural redesigns or additional parameterization. Future research could explore adaptive token selection mechanisms further tuned according to specific tasks, datasets, or objectives, pushing the boundaries of efficiency and applicability of such foundational models in computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Youwei Liang (16 papers)
Chongjian Ge (23 papers)
Zhan Tong (16 papers)
Yibing Song (65 papers)
Jue Wang (203 papers)
Pengtao Xie (86 papers)

Citations (186)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - youweiliang/evit: Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations (162 stars)