Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers (2403.10030v3)

Published 15 Mar 2024 in cs.CV

Abstract: Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs, recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However, these works faced the speed-accuracy trade-off caused by the loss of information. Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper, we propose a Multi-criteria Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria (e.g., similarity, informativeness, and size of fused tokens). Further, we utilize the one-step-ahead attention, which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency, we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5%, and +0.3%) over the base model, respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup without performance degradation. Code is available at https://github.com/mlvlab/MCTF.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (46)

Authors (3)

Sanghyeok Lee (9 papers)
Joonmyung Choi (8 papers)
Hyunwoo J. Kim (70 papers)

Citations (5)

View on Semantic Scholar

Tweets

https://twitter.com/CSVisionPapers/status/1769873932646056388

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers (2403.10030v3)

Related Papers

Tweets