Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Tailed Vision Transformer for Efficient Inference (2203.01587v3)

Published 3 Mar 2022 in cs.CV

Abstract: Recently, Vision Transformer (ViT) has achieved promising performance in image recognition and gradually serves as a powerful backbone in various vision tasks. To satisfy the sequential input of Transformer, the tail of ViT first splits each image into a sequence of visual tokens with a fixed length. Then the following self-attention layers constructs the global relationship between tokens to produce useful representation for the downstream tasks. Empirically, representing the image with more tokens leads to better performance, yet the quadratic computational complexity of self-attention layer to the number of tokens could seriously influence the efficiency of ViT's inference. For computational reduction, a few pruning methods progressively prune uninformative tokens in the Transformer encoder, while leaving the number of tokens before the Transformer untouched. In fact, fewer tokens as the input for the Transformer encoder can directly reduce the following computational cost. In this spirit, we propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper. MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder. A tail predictor is introduced to decide which tail is the most efficient for the image to produce accurate prediction. Both modules are optimized in an end-to-end fashion, with the Gumbel-Softmax trick. Experiments on ImageNet-1K demonstrate that MT-ViT can achieve a significant reduction on FLOPs with no degradation of the accuracy and outperform other compared methods in both accuracy and FLOPs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Multi-exit vision transformer for dynamic inference. arXiv preprint arXiv:2106.15183 .
  2. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901.
  3. End-to-end object detection with transformers, in: European Conference on Computer Vision, Springer. pp. 213–229.
  4. Crossvit: Cross-attention multi-scale vision transformer for image classification, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 357–366.
  5. Autoformer: Searching transformers for visual recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12270–12280.
  6. Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems 34.
  7. Interaction transformer for human reaction generation. IEEE Transactions on Multimedia .
  8. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819 .
  9. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems 34, 9355–9366.
  10. Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee. pp. 248–255.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .
  12. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 .
  13. Adaptive token sampling for efficient vision transformers, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI, Springer. pp. 396–414.
  14. Nasvit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training, in: International Conference on Learning Representations.
  15. Power-bert: Accelerating bert inference via progressive word-vector elimination, in: International Conference on Machine Learning, PMLR. pp. 3690–3699.
  16. Cmt: Convolutional neural networks meet vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12175–12185.
  17. A survey on visual transformer. arXiv preprint arXiv:2012.12556 .
  18. Transformer in transformer. Advances in Neural Information Processing Systems 34, 15908–15919.
  19. Dual transformer for point cloud analysis. IEEE Transactions on Multimedia .
  20. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
  21. Transreid: Transformer-based object re-identification, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 15013–15022.
  22. Channel pruning for accelerating very deep neural networks, in: Proceedings of the IEEE international conference on computer vision, pp. 1389–1397.
  23. Rethinking spatial dimensions of vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11936–11945.
  24. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 .
  25. Searching for mobilenetv3, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 1314–1324.
  26. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 .
  27. Learning disentangled representation implicitly via transformer for occluded person re-identification. IEEE Transactions on Multimedia .
  28. Transgan: Two pure transformers can make one strong gan, and that can scale up. Advances in Neural Information Processing Systems 34.
  29. All tokens matter: Token labeling for training better vision transformers. Advances in Neural Information Processing Systems 34, 18590–18602.
  30. Dilateformer: Multi-scale dilated transformer for visual recognition. IEEE Transactions on Multimedia .
  31. Learning multiple layers of features from tiny images .
  32. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, 1097–1105.
  33. Vitgan: Training gans with vision transformers, in: International Conference on Learning Representations.
  34. Evit: Expediting vision transformers via token reorganizations, in: International Conference on Learning Representations.
  35. Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800 .
  36. Learning efficient convolutional networks through network slimming, in: Proceedings of the IEEE international conference on computer vision, pp. 2736–2744.
  37. Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022.
  38. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712 .
  39. Ia-red2: Interpretability-aware redundancy reduction for vision transformers. Advances in Neural Information Processing Systems 34.
  40. Scalable vision transformers with hierarchical pooling, in: Proceedings of the IEEE/cvf international conference on computer vision, pp. 377–386.
  41. Cream of the crop: Distilling prioritized paths for one-shot neural architecture search. Advances in Neural Information Processing Systems 33, 17955–17964.
  42. Dgcn: Dynamic graph convolutional network for efficient multi-person pose estimation, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11924–11931.
  43. Learning spatiotemporal frequency-transformer for compressed video super-resolution, in: European Conference on Computer Vision, Springer. pp. 257–273.
  44. Ivt: An end-to-end instance-guided video transformer for 3d pose estimation, in: Proceedings of the 30th ACM International Conference on Multimedia, pp. 6174–6182.
  45. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems 34, 13937–13949.
  46. Bottleneck transformers for visual recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519–16529.
  47. Revisiting unreasonable effectiveness of data in deep learning era, in: Proceedings of the IEEE international conference on computer vision, pp. 843–852.
  48. Efficientnet: Rethinking model scaling for convolutional neural networks, in: International Conference on Machine Learning, PMLR. pp. 6105–6114.
  49. Ydtr: infrared and visible image fusion via y-shape dynamic transformer. IEEE Transactions on Multimedia .
  50. Patch slimming for efficient vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12165–12174.
  51. Scop: Scientific control for reliable neural network pruning. Advances in Neural Information Processing Systems 33, 10936–10947.
  52. Training data-efficient image transformers & distillation through attention, in: International Conference on Machine Learning, PMLR. pp. 10347–10357.
  53. Attention is all you need, in: Advances in neural information processing systems, pp. 5998–6008.
  54. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578.
  55. Crossformer: A versatile vision transformer hinging on cross-scale attention, in: International Conference on Learning Representations.
  56. Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. Advances in Neural Information Processing Systems 34, 11960–11973.
  57. Learning to schedule in diffusion probabilistic models, in: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  58. Robust adversarial imitation learning via adaptively-selected demonstrations., in: IJCAI, pp. 3155–3161.
  59. Learning to weight imperfect demonstrations, in: International Conference on Machine Learning, PMLR. pp. 10961–10970.
  60. Rethinking and improving relative position encoding for vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10033–10041.
  61. Tinyvit: Fast pretraining distillation for small vision transformers, in: European Conference on Computer Vision, Springer. pp. 68–85.
  62. Sgda: Towards 3d universal pulmonary nodule detection via slice grouped domain attention. IEEE/ACM Transactions on Computational Biology and Bioinformatics .
  63. Lssanet: A long short slice-aware network for pulmonary nodule detection, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 664–674.
  64. Evo-vit: Slow-fast token evolution for dynamic vision transformer, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2964–2972.
  65. Deep hierarchical vision transformer for hyperspectral and lidar data classification. IEEE Transactions on Image Processing 31, 3095–3110.
  66. A-vit: Adaptive tokens for efficient vision transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10809–10818.
  67. Cyclic differentiable architecture search. IEEE Transactions on Pattern Analysis and Machine Intelligence .
  68. Tokens-to-token vit: Training vision transformers from scratch on imagenet, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567.
  69. Minivit: Compressing vision transformers with weight multiplexing, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12145–12154.
  70. Spatial-channel enhanced transformer for visible-infrared person re-identification. IEEE Transactions on Multimedia .
  71. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 .
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yunke Wang (11 papers)
  2. Bo Du (263 papers)
  3. Wenyuan Wang (50 papers)
  4. Chang Xu (323 papers)
Citations (5)