TPC-ViT: Token Propagation Controller for Efficient Vision Transformer (2401.01470v2)
Abstract: Vision transformers (ViTs) have achieved promising results on a variety of Computer Vision tasks, however their quadratic complexity in the number of input tokens has limited their application specially in resource-constrained settings. Previous approaches that employ gradual token reduction to address this challenge assume that token redundancy in one layer implies redundancy in all the following layers. We empirically demonstrate that this assumption is often not correct, i.e., tokens that are redundant in one layer can be useful in later layers. We employ this key insight to propose a novel token propagation controller (TPC) that incorporates two different token-distributions, i.e., pause probability and restart probability to control the reduction and reuse of tokens respectively, which results in more efficient token utilization. To improve the estimates of token distributions, we propose a smoothing mechanism that acts as a regularizer and helps remove noisy outliers. Furthermore, to improve the training-stability of our proposed TPC, we introduce a model stabilizer that is able to implicitly encode local image structures and minimize accuracy fluctuations during model training. We present extensive experimental results on the ImageNet-1K dataset using DeiT, LV-ViT and Swin models to demonstrate the effectiveness of our proposed method. For example, compared to baseline models, our proposed method improves the inference speed of the DeiT-S by 250% while increasing the classification accuracy by 1.0%.
- An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations, 2021.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357, 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
- Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1280–1289, 2022.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- Generative adversarial transformers. In International conference on machine learning, pages 4487–4499, 2021.
- Transgan: Two pure transformers can make one strong gan, and that can scale up. Advances in Neural Information Processing Systems, 34, 2021.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1601–1610, 2021.
- Rethinking transformer-based set prediction for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3611–3620, 2021.
- Levit: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12259–12269, 2021.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 568–578, 2021.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 558–567, 2021.
- Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 830–844, 2021.
- Nvit: Vision transformer compression and parameter redistribution. CoRR, 2021.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021.
- Not all patches are what you need: Expediting vision transformers via token reorganizations. In International Conference on Learning Representations, 2022.
- A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10809–10818, 2022.
- Shifted chunk transformer for spatio-temporal representational learning. Advances in Neural Information Processing Systems, 34:11384–11396, 2021.
- Multiscale audio spectrogram transformer for efficient audio classification. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Efficient transformers: A survey. ACM Computing Surveys (CSUR), 2020.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Co-scale conv-attentional image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9981–9990, 2021.
- X-vit: High performance linear vision transformer without softmax. CoRR, 2022.
- Soft: softmax-free transformer with linear complexity. Advances in Neural Information Processing Systems, 34:21297–21309, 2021.
- Davit: Dual attention vision transformers. In Proceedings of the European conference on computer vision, 2022.
- Adaptive token sampling for efficient vision transformers. European Conference on Computer Vision, 2022.
- Spatially adaptive computation time for residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1039–1048, 2017.
- Alex Graves. Adaptive computation time for recurrent neural networks. CoRR, 2016.
- Pondernet: Learning to ponder. International Conference on Machine Learning Workshops, 2021.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- All tokens matter: Token labeling for training better vision transformers. Advances in Neural Information Processing Systems, 34:18590–18602, 2021.
- Scalable vision transformers with hierarchical pooling. In Proceedings of the IEEE/CVF international conference on computer vision, pages 377–386, 2021.
- Fastbert: a self-distilling bert with adaptive inference time. In Association for Computational Linguistics, pages 6035–6044, 2020.
- Power-bert: Accelerating bert inference via progressive word-vector elimination. In International Conference on Machine Learning, pages 3690–3699, 2020.
- Depth-adaptive transformer. International Conference on Learning Representations, 2020.
- Tokenlearner: Adaptive space-time tokenization for videos. Advances in Neural Information Processing Systems, 34:12786–12797, 2021.
- Ia-red 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Interpretability-aware redundancy reduction for vision transformers. Advances in Neural Information Processing Systems, 34:24898–24911, 2021.
- Evo-vit: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2964–2972, 2022.
- Dynamic grained encoder for vision transformers. Advances in Neural Information Processing Systems, 34:5770–5783, 2021.
- Patch slimming for efficient vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12165–12174, 2022.
- Wentao Zhu (73 papers)