Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer (2108.01390v5)

Published 3 Aug 2021 in cs.CV

Abstract: Vision transformers (ViTs) have recently received explosive popularity, but the huge computational cost is still a severe issue. Since the computation complexity of ViT is quadratic with respect to the input sequence length, a mainstream paradigm for computation reduction is to reduce the number of tokens. Existing designs include structured spatial compression that uses a progressive shrinking pyramid to reduce the computations of large feature maps, and unstructured token pruning that dynamically drops redundant tokens. However, the limitation of existing token pruning lies in two folds: 1) the incomplete spatial structure caused by pruning is not compatible with structured spatial compression that is commonly used in modern deep-narrow transformers; 2) it usually requires a time-consuming pre-training procedure. To tackle the limitations and expand the applicable scenario of token pruning, we present Evo-ViT, a self-motivated slow-fast token evolution approach for vision transformers. Specifically, we conduct unstructured instance-wise token selection by taking advantage of the simple and effective global class attention that is native to vision transformers. Then, we propose to update the selected informative tokens and uninformative tokens with different computation paths, namely, slow-fast updating. Since slow-fast updating mechanism maintains the spatial structure and information flow, Evo-ViT can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process. Experimental results demonstrate that our method significantly reduces the computational cost of vision transformers while maintaining comparable performance on image classification.

PDF Abstract

An Essay on Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

The paper "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer," authored by Yifan Xu et al., addresses a significant challenge in the domain of vision transformers (ViTs): the computational inefficiency resulting from the dense modeling of dependencies among image tokens. The authors propose Evo-ViT, a dynamic vision transformer that embraces a novel slow-fast token evolution strategy to manage token redundancy effectively throughout the training process.

Overview of Evo-ViT

The conventional Vision Transformer (ViT) framework exhibits inefficiency due to its quadratic computation complexity relative to the input sequence length. Methods such as structured spatial compression and unstructured token pruning have been employed to mitigate this issue by reducing the number of tokens. However, these approaches have inherent limitations, such as incompatibility with modern structured spatial compression utilized in deep-narrow transformers and reliance on time-consuming pre-training procedures. Evo-ViT aims to overcome these limitations by selectively distinguishing informative tokens from placeholder tokens in an unstructured, instance-wise manner within transformers.

Components of Evo-ViT:

Structure Preserving Token Selection: This module retains the spatial structure and exploits the global class attention inherent in vision transformers to dynamically categorize tokens as informative or placeholder without the need for pre-training.
Slow-Fast Token Updating: The Evo-ViT method updates tokens via distinct computation paths. Informative tokens undergo a detailed processing path, while placeholder tokens are aggregated and updated swiftly to maintain information flow and integrity across layers.

Experimental Results

The authors demonstrate the efficacy of Evo-ViT on popular ViT architectures, such as DeiT and LeViT, showing substantial throughput improvements. For instance, Evo-ViT accelerates DeiT-S by over 60% while only incurring a marginal 0.4% reduction in top-1 accuracy on the ImageNet-1K dataset. This performance outstrips current token pruning methods in both accuracy and efficiency metrics.

Implications and Future Directions

The advancements presented in Evo-ViT have significant implications for the development of more computationally efficient vision transformers, making them more accessible for real-time applications. Practically, this method could enhance image classification tasks across various domains, such as autonomous vehicles and smart surveillance systems, that rely heavily on rapid processing and analysis of visual data.

Additionally, Evo-ViT sets the stage for exploring similar strategies in other areas of AI, such as natural language processing, where token handling and model efficiency are critical. Future work could delve into the expansion of this approach to encompass complex downstream tasks like object detection and instance segmentation. Further investigation might also examine alternative aggregation and updating techniques within the slow-fast framework to refine its efficacy and adaptability.

In summary, the Evo-ViT method opens avenues for substantial improvements in the efficiency and applicability of vision transformers, heralding a promising direction for research in transformer models. The authors' approach offers a compelling solution to the inefficiencies historically associated with dense token handling in vision transformers, presenting a progressive step in optimizing deep learning models.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Yifan Xu (92 papers)
Zhijie Zhang (25 papers)
Mengdan Zhang (18 papers)
Kekai Sheng (14 papers)
Ke Li (722 papers)
Weiming Dong (50 papers)
Liqing Zhang (80 papers)
Changsheng Xu (100 papers)
Xing Sun (93 papers)

Citations (173)

View on Semantic Scholar

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer (2108.01390v5)

An Essay on Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Overview of Evo-ViT

Experimental Results

Implications and Future Directions

Related Papers