Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers (2403.10030v3)
Abstract: Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs, recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However, these works faced the speed-accuracy trade-off caused by the loss of information. Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper, we propose a Multi-criteria Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria (e.g., similarity, informativeness, and size of fused tokens). Further, we utilize the one-step-ahead attention, which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency, we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5%, and +0.3%) over the base model, respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup without performance degradation. Code is available at https://github.com/mlvlab/MCTF.
- Learned queries for efficient local attention. In CVPR, 2022.
- Multimae: Multi-modal multi-task masked autoencoders. In ECCV, 2022.
- Longformer: The long-document transformer. arXiv:2004.05150, 2020.
- Token merging: Your vit but faster. ICLR, 2022.
- End-to-end object detection with transformers. In ECCV, 2020.
- Crossvit: Cross-attention multi-scale vision transformer for image classification. In ICCV, 2021.
- Tokenmixup: Efficient attention-guided token-level data augmentation for transformers. In NeurIPS, 2022.
- Rethinking attention with performers. ICLR, 2021.
- Twins: Revisiting the design of spatial attention in vision transformers. NeurIPS, 2021.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2020.
- Adaptive token sampling for efficient vision transformers. In ECCV, 2022.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Rethinking spatial dimensions of vision transformers. In ICCV, 2021.
- All tokens matter: Token labeling for training better vision transformers. NeurIPS, 2021.
- Reformer: The efficient transformer. In ICLR, 2020.
- Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In ECCV, 2022.
- Not all patches are what you need: Expediting vision transformers via token reorganizations. ICLR, 2022.
- Tokenmix: Rethinking image mixing for data augmentation in vision transformers. In ECCV, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Beyond attentive tokens: Incorporating token importance and diversity for efficient vision transformers. In CVPR, 2023.
- Sgdr: Stochastic gradient descent with warm restarts. ICLR, 2017.
- Token pooling in vision transformers for image classification. In WACV, 2023.
- Adavit: Adaptive vision transformers for efficient image recognition. In CVPR, 2022.
- IA-RED22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Interpretability-aware redundancy reduction for vision transformers. NeurIPS, 2021.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. NeurIPS, 2021.
- Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021a.
- Going deeper with image transformers. In ICCV, 2021b.
- Linformer: Self-attention with linear complexity. arXiv:2006.04768, 2020.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
- Denoising masked autoencoders help robust classification. In ICLR, 2023.
- Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS, 2021.
- Unsupervised data augmentation for consistency training. NeurIPS, 2020.
- Nyströmformer: A nyström-based algorithm for approximating self-attention. In AAAI, 2021.
- Co-scale conv-attentional image transformers. In ICCV, 2021.
- Evo-vit: Slow-fast token evolution for dynamic vision transformer. In AAAI, 2022.
- A-vit: Adaptive tokens for efficient vision transformer. In CVPR, 2022.
- Metaformer is actually what you need for vision. In CVPR, 2022.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, 2021.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
- Scaling vision transformers. In CVPR, 2022.
- mixup: Beyond empirical risk minimization. In ICLR, 2018.
- Deformable detr: Deformable transformers for end-to-end object detection. ICLR, 2021.
- Sanghyeok Lee (9 papers)
- Joonmyung Choi (8 papers)
- Hyunwoo J. Kim (70 papers)