Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers (2211.11315v1)

Published 21 Nov 2022 in cs.CV

Abstract: Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. Many pruning methods have been proposed to remove redundant tokens for efficient vision transformers recently. However, existing studies mainly focus on the token importance to preserve local attentive tokens but completely ignore the global token diversity. In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. According to the class token attention, we decouple the attentive and inattentive tokens. In addition to preserving the most discriminative local tokens, we merge similar inattentive tokens and match homogeneous attentive tokens to maximize the token diversity. Despite its simplicity, our method obtains a promising trade-off between model complexity and classification accuracy. On DeiT-S, our method reduces the FLOPs by 35% with only a 0.2% accuracy drop. Notably, benefiting from maintaining the token diversity, our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.

PDF Abstract

Incorporating Token Importance and Diversity in Vision Transformers

In "Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers," Long et al. introduce a novel approach to enhance the efficiency of Vision Transformers (ViTs) by addressing the dual aspects of token importance and diversity in token pruning. This paper advances the field by proposing a method that yields substantial computational savings with minimal impact on accuracy, mitigating a common challenge in ViTs related to the quadratic computational complexity associated with token interactions.

Core Contributions

Token Decoupling and Merging Method: The authors present a novel token decoupling and merging method, which intelligently balances the retention of attentive tokens with the preservation of token diversity. This strategy is aimed at enhancing both the performance and efficiency of ViTs.
Incorporation of Token Diversity: Unlike prior approaches that primarily focus on token importance, this work emphasizes the significance of token diversity. By not discarding less attentive tokens entirely but instead clustering and merging them, the model leverages diverse semantic information, which is demonstrated to be beneficial for improving classification accuracy, particularly at lower keep rates.
Empirical Results: The method shows strong results, achieving a 35% reduction in FLOPs with a mere 0.2% drop in accuracy for the DeiT-S model. Furthermore, an improvement in accuracy by 0.1% with a 40% reduction in FLOPs for the DeiT-T model is highlighted, showcasing the practical utility of preserving token diversity.

Methodology

The approach involves two primary components:

Token Decoupling: Initial decoupling of tokens is based on class token attention, separating attentive tokens that contribute significantly to the model's performance from less attentive ones.
Token Merging: Instead of discarding less informative tokens, a simplified density peak clustering algorithm is applied to merge similar tokens. Additionally, the method introduces a matching algorithm for fusing homogeneous attentive tokens, thereby enhancing calculation efficiency without additional computational overhead.

Implementation and Evaluation

Experiments were conducted using the ImageNet dataset, testing on various ViT models, including DeiT-T, DeiT-S, DeiT-B, and LV-ViT-S. The outcomes affirm the efficacy of this novel method across different configurations, with the proposed technique outperforming existing state-of-the-art token pruning methods, including DyViT and EViT, in terms of achieving a favorable balance between accuracy and computational resource usage.

Implications and Future Directions

The proposed approach underscores the importance of considering both token importance and token diversity in designing efficient ViTs. This dual consideration may lead to more generalized strategies that can be applied to other transformer models beyond vision, potentially influencing architectures in both computer vision and natural language processing.

Looking ahead, this work opens up several avenues for further research:

Generalization to Other Domains: Exploring whether similar principles can be applied to transformers in different domains such as text or mixed modalities.
Integration with Other Efficiency Techniques: Combining this method with other efficiency-enhancement techniques like quantization or distillation to further optimize resource use.
Improvement of Clustering Algorithms: Investigating more advanced or adaptive clustering algorithms that might offer even better trade-offs between efficiency and model performance.

The approach proposed by Long et al. marks a significant stride toward more efficient transformer architectures, reinforcing the dual role of importance and diversity in token management strategies. Their findings contribute valuable insights that can stimulate further exploration and innovation in transformer-based models.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Sifan Long (11 papers)
Zhen Zhao (85 papers)
Jimin Pi (6 papers)
Shengsheng Wang (14 papers)
Jingdong Wang (236 papers)

Citations (24)

View on Semantic Scholar