Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Visual Transformer (2012.12556v6)

Published 23 Dec 2020 in cs.CV and cs.AI
A Survey on Visual Transformer

Abstract: Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of networks such as convolutional and recurrent neural networks. Given its high performance and less need for vision-specific inductive bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efficient transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the challenges and provide several further research directions for vision transformers.

A Survey on Visual Transformer

The paper "A Survey on Visual Transformer" by Han et al. provides a thorough and comprehensive review of the application of transformer models in the domain of computer vision (CV). Transformer models, well-known for their success in NLP, have recently been adapted for visual tasks, leading to significant advancements. This survey meticulously categorizes and analyzes the state-of-the-art vision transformers, covering their implementations across various tasks such as backbone networks, high/mid-level vision, low-level vision, and video processing.

Vision Transformers in Backbone Networks

One of the primary applications of transformers in CV is as backbone networks for image representation. Unlike traditional convolutional neural networks (CNNs), vision transformers leverage the self-attention mechanism to capture global dependencies in images. The paper reviews several notable models, including Vision Transformer (ViT), Data-efficient image transformer (DeiT), Twin Transformer (TNT), and Swin Transformer. ViT employs a straightforward approach by treating an image as a sequence of patches and applying a transformer encoder to this sequence. It achieves impressive results when pre-trained on large datasets. DeiT introduces data-efficient training strategies that allow competitive performance using only the ImageNet dataset. The Swin Transformer uses a hierarchical structure with shifted windows to balance local and global context efficiently.

High/Mid-Level Vision Tasks

Transformers have been effectively applied to high-level vision tasks such as object detection, segmentation, and pose estimation. In object detection, the Detection Transformer (DETR) reformulates the task as a set prediction problem, eliminating the need for conventional components like anchor generation and non-maximum suppression. Deformable DETR enhances detection performance by employing a deformable attention mechanism, leading to faster convergence and improved accuracy, particularly on small objects.

For segmentation tasks, the paper reviews models like SETR and Max-DeepLab, which leverage transformer architectures for pixel-level predictions. SETR uses standard transformer encoders to extract image features, which are then used for semantic segmentation. Max-DeepLab, on the other hand, introduces a dual-path network combining CNN and transformer features for end-to-end panoptic segmentation.

In pose estimation, transformers have been utilized to capture the complex relationships in human and hand poses. Models like METRO (Mesh Transformer) and Hand-Transformer demonstrate the application of transformers in predicting 3D poses from 2D images or point clouds, showcasing their ability to handle intricate spatial dependencies.

Low-Level Vision Tasks

While traditionally dominated by CNNs, low-level vision tasks such as image generation and enhancement have also benefited from transformer models. The Image Processing Transformer (IPT) utilizes pre-training on large datasets with various degradation processes, achieving state-of-the-art results in super-resolution, denoising, and deraining tasks. IPT's architecture, featuring a multi-head and multi-tail structure, exemplifies the adaptiveness of transformers to different image processing tasks.

In image generation, models like TransGAN and Taming Transformer employ the transformer architecture to generate high-resolution images, leveraging the global context captured by self-attention mechanisms. These models outperform conventional GANs, particularly in terms of generating coherent and contextually accurate images.

Video Processing

Transformers have also demonstrated promising results in the video domain. For video action recognition, models such as the action transformer use self-attention to model the relationship between subjects and context in a video frame. In video object detection, architectures like the spatiotemporal transformer integrate spatial and temporal information to improve detection performance effectively.

Video inpainting is another area where transformers excel. The Spatial-Temporal Transformer Network (STTN), for instance, employs self-attention to fill in missing regions in video frames by capturing and replicating contextual information both spatially and temporally.

Efficient Transformer Methods

Transformers inherently require significant computational resources, which is a bottleneck for their deployment in resource-limited environments. The survey discusses several methods to improve the efficiency of transformers, including network pruning, low-rank decomposition, knowledge distillation, and quantization. These techniques aim to reduce the computational burden while maintaining model performance, making transformers more viable for practical applications.

Implications and Future Directions

The advancements in vision transformers represent substantial progress in CV, demonstrating their potential to outperform traditional CNNs across various tasks. From a theoretical standpoint, the ability of transformers to capture long-range dependencies and integrate global context is fundamentally transformative for CV models.

Future directions include the development of more efficient transformers, understanding the interpretability of their decisions, and leveraging their pre-training capabilities for robust generalization across domains. Research must also focus on unifying models for multitask learning, potentially leading to grand unified models capable of addressing a wide array of visual and possibly multimodal tasks.

In conclusion, while transformers have already reshaped the trajectory of computer vision research, continued exploration into efficient architectures and broader applications promises even greater advancements and more versatile AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (308)
  1. F. Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
  2. F. ROSENBLATT. Principles of neurodynamics. perceptrons and the theory of brain mechanisms. Technical report, 1961.
  3. Y. LeCun et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  4. A. Krizhevsky et al. Imagenet classification with deep convolutional neural networks. In NeurIPS, pp. 1097–1105, 2012.
  5. D. E. Rumelhart et al. Learning internal representations by error propagation. Technical report, 1985.
  6. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  7. D. Bahdanau et al. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
  8. A. Parikh et al. A decomposable attention model for natural language inference. In EMNLP, 2016.
  9. A. Vaswani et al. Attention is all you need. In NeurIPS, 2017.
  10. J. Devlin et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
  11. T. B. Brown et al. Language models are few-shot learners. In NeurIPS, 2020.
  12. K. He et al. Deep residual learning for image recognition. In CVPR, pp. 770–778, 2016.
  13. S. Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  14. M. Chen et al. Generative pretraining from pixels. In ICML, 2020.
  15. A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  16. N. Carion et al. End-to-end object detection with transformers. In ECCV, 2020.
  17. X. Zhu et al. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
  18. S. Zheng et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
  19. H. Chen et al. Pre-trained image processing transformer. In CVPR, 2021.
  20. L. Zhou et al. End-to-end dense video captioning with masked transformer. In CVPR, pp. 8739–8748, 2018.
  21. S. Ullman et al. High-level vision: Object recognition and visual cognition, volume 2. MIT press Cambridge, MA, 1996.
  22. R. Kimchi et al. Perceptual organization in vision: Behavioral and neural perspectives. Psychology Press, 2003.
  23. J. Zhu et al. Top-down saliency detection via contextual pooling. Journal of Signal Processing Systems, 74(1):33–46, 2014.
  24. J. Long et al. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  25. H. Wang et al. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, pp. 5463–5474, 2021.
  26. R. B. Fisher. Cvonline: The evolving, distributed, non-proprietary, on-line compendium of computer vision. Retrieved January 28, 2006 from http://homepages. inf. ed. ac. uk/rbf/CVonline, 2008.
  27. N. Parmar et al. Image transformer. In ICML, 2018.
  28. Y. Zeng et al. Learning joint spatial-temporal transformations for video inpainting. In ECCV, pp. 528–543. Springer, 2020.
  29. K. Han et al. Transformer in transformer. In NeurIPS, 2021.
  30. H. Cao et al. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv:2105.05537, 2021.
  31. X. Chen et al. An empirical study of training self-supervised vision transformers. In ICCV, 2021.
  32. K. He et al. Masked autoencoders are scalable vision learners. In CVPR, pp. 16000–16009, 2022.
  33. Z. Dai et al. UP-DETR: unsupervised pre-training for object detection with transformers. In CVPR, 2021.
  34. Y. Wang et al. End-to-end video instance segmentation with transformers. In CVPR, 2021.
  35. L. Huang et al. Hand-transformer: Non-autoregressive structured modeling for 3d hand pose estimation. In ECCV, pp. 17–33, 2020.
  36. L. Huang et al. Hot-net: Non-autoregressive transformer for 3d hand-object pose estimation. In ACM MM, pp. 3136–3145, 2020.
  37. K. Lin et al. End-to-end human pose and mesh reconstruction with transformers. In CVPR, 2021.
  38. P. Esser et al. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  39. Y. Jiang et al. Transgan: Two transformers can make one strong gan. In NeurIPS, 2021.
  40. F. Yang et al. Learning texture transformer network for image super-resolution. In CVPR, pp. 5791–5800, 2020.
  41. A. Radford et al. Learning transferable visual models from natural language supervision. arXiv:2103.00020, 2021.
  42. A. Ramesh et al. Zero-shot text-to-image generation. In ICML, 2021.
  43. M. Ding et al. Cogview: Mastering text-to-image generation via transformers. In NeurIPS, 2021.
  44. OpenAI. Gpt-4 technical report, 2023.
  45. P. Michel et al. Are sixteen heads really better than one? In NeurIPS, pp. 14014–14024, 2019.
  46. X. Jiao et al. TinyBERT: Distilling BERT for natural language understanding. In Findings of EMNLP, pp. 4163–4174, 2020.
  47. G. Prato et al. Fully quantized transformer for machine translation. In Findings of EMNLP, 2020.
  48. Z.-H. Jiang et al. Convbert: Improving bert with span-based dynamic convolution. NeurIPS, 33, 2020.
  49. J. Gehring et al. Convolutional sequence to sequence learning. In ICML, pp. 1243–1252. PMLR, 2017.
  50. P. Shaw et al. Self-attention with relative position representations. In NAACL, pp. 464–468, 2018.
  51. D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv:1606.08415, 2016.
  52. J. L. Ba et al. Layer normalization. arXiv:1607.06450, 2016.
  53. A. Baevski and M. Auli. Adaptive input representations for neural language modeling. In ICLR, 2019.
  54. Q. Wang et al. Learning deep transformer models for machine translation. In ACL, pp. 1810–1822, 2019.
  55. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  56. S. Shen et al. Powernorm: Rethinking batch normalization in transformers. In ICML, 2020.
  57. J. Xu et al. Understanding and improving layer normalization. In NeurIPS, 2019.
  58. T. Bachlechner et al. Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pp. 1352–1361. PMLR, 2021.
  59. B. Wu et al. Visual transformers: Token-based image representation and processing for computer vision. arXiv:2006.03677, 2020.
  60. H. Touvron et al. Training data-efficient image transformers & distillation through attention. In ICML, 2020.
  61. Z. Liu et al. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  62. C.-F. Chen et al. Regionvit: Regional-to-local attention for vision transformers. arXiv:2106.02689, 2021.
  63. X. Chu et al. Twins: Revisiting the design of spatial attention in vision transformers. arXiv:2104.13840, 2021.
  64. H. Lin et al. Cat: Cross attention in vision transformer. arXiv, 2021.
  65. X. Dong et al. Cswin transformer: A general vision transformer backbone with cross-shaped windows. arXiv:2107.00652, 2021.
  66. Z. Huang et al. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv:2106.03650, 2021.
  67. J. Fang et al. Msg-transformer: Exchanging local spatial information by manipulating messenger tokens. arXiv:2105.15168, 2021.
  68. L. Yuan et al. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, 2021.
  69. D. Zhou et al. Deepvit: Towards deeper vision transformer. arXiv, 2021.
  70. P. Wang et al. Kvt: k-nn attention for boosting vision transformers. arXiv:2106.00515, 2021.
  71. D. Zhou et al. Refiner: Refining self-attention for vision transformers. arXiv:2106.03714, 2021.
  72. A. El-Nouby et al. Xcit: Cross-covariance image transformers. arXiv:2106.09681, 2021.
  73. W. Wang et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
  74. S. Sun* et al. Visual parser: Representing part-whole hierarchies with transformers. arXiv:2107.05790, 2021.
  75. H. Fan et al. Multiscale vision transformers. arXiv:2104.11227, 2021.
  76. Z. Zhang et al. Nested hierarchical transformer: Towards accurate, data-efficient and interpretable visual understanding. In AAAI, 2022.
  77. Z. Pan et al. Less is more: Pay less attention in vision transformers. In AAAI, 2022.
  78. Z. Pan et al. Scalable visual transformers with hierarchical pooling. In ICCV, 2021.
  79. B. Heo et al. Rethinking spatial dimensions of vision transformers. In ICCV, 2021.
  80. C.-F. Chen et al. Crossvit: Cross-attention multi-scale vision transformer for image classification. In ICCV, 2021.
  81. Z. Wang et al. Uformer: A general u-shaped transformer for image restoration. arXiv:2106.03106, 2021.
  82. X. Zhai et al. Scaling vision transformers. arXiv:2106.04560, 2021.
  83. X. Su et al. Vision transformer architecture search. arXiv, 2021.
  84. M. Chen et al. Autoformer: Searching transformers for visual recognition. In ICCV, pp. 12270–12280, 2021.
  85. B. Chen et al. Glit: Neural architecture search for global and local image transformer. In ICCV, pp. 12–21, 2021.
  86. X. Chu et al. Conditional positional encodings for vision transformers. arXiv:2102.10882, 2021.
  87. K. Wu et al. Rethinking and improving relative position encoding for vision transformer. In ICCV, 2021.
  88. H. Touvron et al. Going deeper with image transformers. arXiv:2103.17239, 2021.
  89. Y. Tang et al. Augmented shortcuts for vision transformers. In NeurIPS, 2021.
  90. I. Tolstikhin et al. Mlp-mixer: An all-mlp architecture for vision. arXiv:2105.01601, 2021.
  91. L. Melas-Kyriazi. Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet. arXiv:2105.02723, 2021.
  92. M.-H. Guo et al. Beyond self-attention: External attention using two linear layers for visual tasks. arXiv:2105.02358, 2021.
  93. H. Touvron et al. Resmlp: Feedforward networks for image classification with data-efficient training. arXiv:2105.03404, 2021.
  94. M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
  95. J. Guo et al. Cmt: Convolutional neural networks meet vision transformers. arXiv:2107.06263, 2021.
  96. L. Yuan et al. Volo: Vision outlooker for visual recognition. arXiv:2106.13112, 2021.
  97. H. Wu et al. Cvt: Introducing convolutions to vision transformers. arXiv:2103.15808, 2021.
  98. K. Yuan et al. Incorporating convolution designs into visual transformers. arXiv:2103.11816, 2021.
  99. Y. Li et al. Localvit: Bringing locality to vision transformers. arXiv:2104.05707, 2021.
  100. B. Graham et al. Levit: a vision transformer in convnet’s clothing for faster inference. In ICCV, 2021.
  101. A. Srinivas et al. Bottleneck transformers for visual recognition. In CVPR, 2021.
  102. Z. Chen et al. Visformer: The vision-friendly transformer. arXiv, 2021.
  103. T. Xiao et al. Early convolutions help transformers see better. In NeurIPS, volume 34, 2021.
  104. Autoencoders, minimum description length, and helmholtz free energy. NIPS, 6:3–10, 1994.
  105. P. Vincent et al. Extracting and composing robust features with denoising autoencoders. In ICML, pp. 1096–1103, 2008.
  106. A. v. d. Oord et al. Conditional image generation with pixelcnn decoders. arXiv preprint arXiv:1606.05328, 2016.
  107. D. Pathak et al. Context encoders: Feature learning by inpainting. In CVPR, pp. 2536–2544, 2016.
  108. Z. Li et al. Mst: Masked self-supervised transformer for visual representation. In NeurIPS, 2021.
  109. H. Bao et al. Beit: Bert pre-training of image transformers. arXiv:2106.08254, 2021.
  110. A. Radford et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  111. Z. Xie et al. Simmim: A simple framework for masked image modeling. In CVPR, pp. 9653–9663, 2022.
  112. Z. Xie et al. Self-supervised learning with swin transformers. arXiv:2105.04553, 2021.
  113. C. Li et al. Efficient self-supervised vision transformers for representation learning. arXiv:2106.09785, 2021.
  114. K. He et al. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  115. J. Beal et al. Toward transformer-based object detection. arXiv:2012.09958, 2020.
  116. Z. Yuan et al. Temporal-channel transformer for 3d lidar-based video object detection for autonomous driving. IEEE TCSVT, 2021.
  117. X. Pan et al. 3d object detection with pointformer. In CVPR, 2021.
  118. R. Liu et al. End-to-end lane shape prediction with transformers. In WACV, 2021.
  119. S. Yang et al. Transpose: Keypoint localization via transformer. In ICCV, 2021.
  120. D. Zhang et al. Feature pyramid transformer. In ECCV, 2020.
  121. C. Chi et al. Relationnet++: Bridging visual representations for object detection via transformer decoder. NeurIPS, 2020.
  122. Z. Sun et al. Rethinking transformer-based set prediction for object detection. In ICCV, pp. 3611–3620, 2021.
  123. M. Zheng et al. End-to-end object detection with adaptive clustering transformer. In BMVC, 2021.
  124. T. Ma et al. Oriented object detection with transformer. arXiv:2106.03146, 2021.
  125. P. Gao et al. Fast convergence of detr with spatially modulated co-attention. In ICCV, 2021.
  126. Z. Yao et al. Efficient detr: Improving end-to-end object detector with dense prior. arXiv:2104.01318, 2021.
  127. Z. Tian et al. Fcos: Fully convolutional one-stage object detection. In ICCV, pp. 9627–9636, 2019.
  128. Y. Fang et al. You only look at one sequence: Rethinking transformer in vision through object detection. In NeurIPS, 2021.
  129. T.-Y. Lin et al. Focal loss for dense object detection. In ICCV, 2017.
  130. Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR, 2018.
  131. A. Bar et al. Detreg: Unsupervised pretraining with region priors for object detection. arXiv:2106.04550, 2021.
  132. J. Hu et al. Istr: End-to-end instance segmentation with transformers. arXiv:2105.00637, 2021.
  133. Z. Yang et al. Associating objects with transformers for video object segmentation. In NeurIPS, 2021.
  134. S. Wu et al. Fully transformer networks for semantic image segmentation. arXiv:2106.04108, 2021.
  135. B. Dong et al. Solq: Segmenting objects by learning queries. In NeurIPS, 2021.
  136. R. Strudel et al. Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
  137. E. Xie et al. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
  138. J. M. J. Valanarasu et al. Medical transformer: Gated axial-attention for medical image segmentation. In MICCAI, 2021.
  139. T. Prangemeier et al. Attention-based transformers for instance segmentation of cells in microstructures. In International Conference on Bioinformatics and Biomedicine, pp. 700–707. IEEE, 2020.
  140. C. R. Qi et al. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pp. 652–660, 2017.
  141. C. R. Qi et al. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 30:5099–5108, 2017.
  142. S. Hampali et al. Handsformer: Keypoint transformer for monocular 3d pose estimation ofhands and object in interaction. arXiv, 2021.
  143. Y. Li et al. Tokenpose: Learning keypoint tokens for human pose estimation. In ICCV, 2021.
  144. W. Mao et al. Tfpose: Direct human pose estimation with transformers. arXiv:2103.15320, 2021.
  145. T. Jiang et al. Skeletor: Skeletal transformers for robust body-pose estimation. In CVPR, 2021.
  146. Y. Li et al. Test-time personalization with a transformer for human pose estimation. Advances in Neural Information Processing Systems, 34, 2021.
  147. M. Lin et al. Detr for pedestrian detection. arXiv:2012.06785, 2020.
  148. L. Tabelini et al. Polylanenet: Lane estimation via deep polynomial regression. In 2020 25th International Conference on Pattern Recognition (ICPR), pp. 6150–6156. IEEE, 2021.
  149. L. Liu et al. Condlanenet: a top-to-down lane detection framework based on conditional convolution. arXiv:2105.05003, 2021.
  150. P. Xu et al. A survey of scene graph: Generation and application. IEEE Trans. Neural Netw. Learn. Syst, 2020.
  151. J. Yang et al. Graph r-cnn for scene graph generation. In ECCV, 2018.
  152. S. Sharifzadeh et al. Classification by attention: Scene graph classification with prior knowledge. In AAAI, 2021.
  153. S. Sharifzadeh et al. Improving Visual Reasoning by Exploiting The Knowledge in Texts. arXiv:2102.04760, 2021.
  154. C. Raffel et al. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(140):1–67, 2020.
  155. N. Wang et al. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR, 2021.
  156. M. Zhao et al. TrTr: Visual Tracking with Transformer. arXiv:2105.03817 [cs], May 2021. arXiv: 2105.03817.
  157. X. Chen et al. Transformer tracking. In CVPR, 2021.
  158. P. Sun et al. TransTrack: Multiple Object Tracking with Transformer. arXiv:2012.15460 [cs], May 2021. arXiv: 2012.15460.
  159. S. He et al. TransReID: Transformer-based object re-identification. In ICCV, 2021.
  160. X. Liu et al. A video is worth three views: Trigeminal transformers for video-based person re-identification. arXiv:2104.01745, 2021.
  161. T. Zhang et al. Spatiotemporal transformer for video-based person re-identification. arXiv:2103.16469, 2021.
  162. N. Engel et al. Point transformer. IEEE Access, 9:134826–134840, 2021.
  163. M.-H. Guo et al. Pct: Point cloud transformer. Computational Visual Media, 7(2):187–199, 2021.
  164. H. Zhao et al. Point transformer. In ICCV, pp. 16259–16268, 2021.
  165. K. Lee et al. Vitgan: Training gans with vision transformers. arXiv preprint arXiv:2107.04589, 2021.
  166. A. v. d. Oord et al. Neural discrete representation learning. arXiv, 2017.
  167. J. Ho et al. Denoising diffusion probabilistic models. volume 33, pp. 6840–6851, 2020.
  168. A. Ramesh et al. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  169. R. Rombach et al. High-resolution image synthesis with latent diffusion models. In CVPR, pp. 10684–10695, 2022.
  170. X. Wang et al. Sceneformer: Indoor scene generation with transformers. In 3DV, pp. 106–115. IEEE, 2021.
  171. Z. Liu et al. Convtransformer: A convolutional transformer network for video frame synthesis. arXiv:2011.10185, 2020.
  172. R. Girdhar et al. Video action transformer network. In CVPR, 2019.
  173. H. Liu et al. Two-stream transformer networks for video-based face alignment. T-PAMI, 40(11):2546–2554, 2017.
  174. J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  175. S. Lohit et al. Temporal transformer networks: Joint learning of invariant and discriminative time warping. In CVPR, 2019.
  176. M. Fayyaz and J. Gall. Sct: Set constrained temporal transformer for set supervised action segmentation. In 2020 CVPR, pp. 501–510, 2020.
  177. W. Choi et al. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In ICCVW, 2009.
  178. K. Gavrilyuk et al. Actor-transformers for group activity recognition. In CVPR, pp. 839–848, 2020.
  179. J. Shao et al. Temporal context aggregation for video retrieval with contrastive learning. In WACV, 2021.
  180. V. Gabeur et al. Multi-modal transformer for video retrieval. In ECCV, pp. 214–229, 2020.
  181. Y. Chen et al. Memory enhanced global-local aggregation for video object detection. In CVPR, pp. 10337–10346, 2020.
  182. J. Yin et al. Lidar-based online 3d video object detection with graph-based message passing and spatiotemporal transformer attention. In 2020 CVPR, pp. 11495–11504, 2020.
  183. H. Seong et al. Video multitask transformer network. In ICCVW, 2019.
  184. K. M. Schatz et al. A recurrent transformer network for novel view action synthesis. In ECCV (27), pp. 410–426, 2020.
  185. C. Sun et al. Videobert: A joint model for video and language representation learning. In ICCV, pp. 7464–7473, 2019.
  186. L. H. Li et al. Visualbert: A simple and performant baseline for vision and language. arXiv:1908.03557, 2019.
  187. W. Su et al. Vl-bert: Pre-training of generic visual-linguistic representations. In ICLR, 2020.
  188. Y.-S. Chuang et al. Speechbert: Cross-modal pre-trained language model for end-to-end spoken question answering. In Interspeech, 2020.
  189. R. Hu and A. Singh. Unit: Multimodal multitask learning with a unified transformer. In ICCV, 2021.
  190. S. Prasanna et al. When bert plays the lottery, all tickets are winning. In EMNLP, 2020.
  191. J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2018.
  192. Y. Tang et al. Patch slimming for efficient vision transformers. arXiv:2106.02852, 2021.
  193. M. Zhu et al. Vision transformer pruning. arXiv:2104.08500, 2021.
  194. Z. Liu et al. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
  195. Z. Lan et al. Albert: A lite bert for self-supervised learning of language representations. In ICLR, 2020.
  196. C. Xu et al. Bert-of-theseus: Compressing bert by progressive module replacing. In EMNLP, pp. 7859–7869, 2020.
  197. S. Shen et al. Q-bert: Hessian based ultra low precision quantization of bert. In AAAI, pp. 8815–8821, 2020.
  198. O. Zafrir et al. Q8bert: Quantized 8bit bert. arXiv:1910.06188, 2019.
  199. V. Sanh et al. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv:1910.01108, 2019.
  200. S. Sun et al. Patient knowledge distillation for bert model compression. In EMNLP-IJCNLP, pp. 4323–4332, 2019.
  201. Z. Sun et al. Mobilebert: a compact task-agnostic bert for resource-limited devices. In ACL, pp. 2158–2170, 2020.
  202. I. Turc et al. Well-read students learn better: The impact of student initialization on knowledge distillation. arXiv:1908.08962, 2019.
  203. X. Qiu et al. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, pp. 1–26, 2020.
  204. A. Fan et al. Reducing transformer depth on demand with structured dropout. In ICLR, 2020.
  205. L. Hou et al. Dynabert: Dynamic bert with adaptive width and depth. NeurIPS, 33, 2020.
  206. Z. Wang et al. Structured pruning of large language models. In EMNLP, pp. 6151–6162, 2020.
  207. G. Hinton et al. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.
  208. C. Buciluǎ et al. Model compression. In SIGKDD, pp. 535–541, 2006.
  209. J. Ba and R. Caruana. Do deep nets really need to be deep? NIPS, 2014.
  210. S. Mukherjee and A. H. Awadallah. Xtremedistil: Multi-stage distillation for massive multilingual models. In ACL, pp. 2221–2234, 2020.
  211. W. Wang et al. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. arXiv:2002.10957, 2020.
  212. S. I. Mirzadeh et al. Improved knowledge distillation via teacher assistant. In AAAI, 2020.
  213. D. Jia et al. Efficient vision transformers via fine-grained manifold distillation. arXiv:2107.01378, 2021.
  214. V. Vanhoucke et al. Improving the speed of neural networks on cpus. In NIPS Workshop, 2011.
  215. Z. Yang et al. Searching for low-bit weights in quantized neural networks. In NeurIPS, 2020.
  216. E. Park and S. Yoo. Profit: A novel training method for sub-4-bit mobilenet models. In ECCV, pp. 430–446. Springer, 2020.
  217. J. Fromm et al. Riptide: Fast end-to-end binarized neural networks. Proceedings of Machine Learning and Systems, 2:379–389, 2020.
  218. Y. Bai et al. Proxquant: Quantized neural networks via proximal operators. In ICLR, 2019.
  219. A. Bhandare et al. Efficient 8-bit quantization of transformer neural machine language translation model. arXiv:1906.00532, 2019.
  220. C. Fan. Quantized transformer. Technical report, Stanford Univ., 2019.
  221. K. Shridhar et al. End to end binarized neural networks for text classification. In SustaiNLP, 2020.
  222. R. Cheong and R. Daniel. transformers. zip: Compressing transformers with pruning and quantization. Technical report, 2019.
  223. Z. Zhao et al. An investigation on different underlying quantization schemes for pre-trained language models. In NLPCC, 2020.
  224. Z. Liu et al. Post-training quantization for vision transformer. In NeurIPS, 2021.
  225. Z. Wu et al. Lite transformer with long-short range attention. In ICLR, 2020.
  226. Z. Geng et al. Is attention better than matrix decomposition? In ICLR, 2020.
  227. Y. Guo et al. Nat: Neural architecture transformer for accurate and compact architectures. In NeurIPS, pp. 737–748, 2019.
  228. D. So et al. The evolved transformer. In ICML, pp. 5877–5886, 2019.
  229. C. Li et al. Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In ICCV, 2021.
  230. A. Katharopoulos et al. Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, 2020.
  231. C. Yun et al. o⁢(n)𝑜𝑛o(n)italic_o ( italic_n ) connections are expressive enough: Universal approximability of sparse transformers. In NeurIPS, 2020.
  232. M. Zaheer et al. Big bird: Transformers for longer sequences. In NeurIPS, 2020.
  233. Spectral sparsification of graphs. SIAM Journal on Computing, 40(4), 2011.
  234. F. Chung and L. Lu. The average distances in random graphs with given expected degrees. PNAS, 99(25):15879–15882, 2002.
  235. A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  236. X. Zhai et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv:1910.04867, 2019.
  237. Y. Cheng et al. Robust neural machine translation with doubly adversarial inputs. In ACL, 2019.
  238. W. E. Zhang et al. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM TIST, 11(3):1–41, 2020.
  239. K. Mahmood et al. On the robustness of vision transformers to adversarial examples. arXiv:2104.02610, 2021.
  240. X. Mao et al. Towards robust vision transformer. arXiv, 2021.
  241. S. Serrano and N. A. Smith. Is attention interpretable? In ACL, 2019.
  242. S. Wiegreffe and Y. Pinter. Attention is not not explanation. In EMNLP-IJCNLP, 2019.
  243. H. Chefer et al. Transformer interpretability beyond attention visualization. In CVPR, pp. 782–791, 2021.
  244. R. Livni et al. On the computational efficiency of training neural networks. In NeurIPS, 2014.
  245. B. Neyshabur et al. Towards understanding the role of over-parametrization in generalization of neural networks. In ICLR, 2019.
  246. K. Han et al. Ghostnet: More features from cheap operations. In CVPR, pp. 1580–1589, 2020.
  247. K. Han et al. Model rubik’s cube: Twisting resolution, depth and width for tinynets. NeurIPS, 33, 2020.
  248. T. Chen et al. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, pp. 269–284, 2014.
  249. H. Liao et al. Davinci: A scalable architecture for neural network computing. In 2019 IEEE Hot Chips 31 Symposium (HCS), 2019.
  250. A. Jaegle et al. Perceiver: General perception with iterative attention. In ICML, volume 139, pp. 4651–4664. PMLR, 18–24 Jul 2021.
  251. A. Jaegle et al. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
  252. X. Wang et al. Non-local neural networks. In CVPR, pp. 7794–7803, 2018.
  253. A. Buades et al. A non-local algorithm for image denoising. In CVPR, pp. 60–65, 2005.
  254. J. Chung et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014.
  255. M. Joshi et al. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77, 2020.
  256. Y. Liu et al. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692, 2019.
  257. Y. Zhu et al. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, pp. 19–27, 2015.
  258. A. Radford et al. Improving language understanding by generative pre-training, 2018.
  259. Z. Yang et al. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS, pp. 5753–5763, 2019.
  260. K. Clark et al. Electra: Pre-training text encoders as discriminators rather than generators. arXiv:2003.10555, 2020.
  261. L. Dong et al. Unified language model pre-training for natural language understanding and generation. In NeurIPS, pp. 13063–13075, 2019.
  262. M. Lewis et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv:1910.13461, 2019.
  263. Z. Zhang et al. Ernie: Enhanced language representation with informative entities. arXiv:1905.07129, 2019.
  264. M. E. Peters et al. Knowledge enhanced contextual word representations. arXiv:1909.04164, 2019.
  265. J. Lee et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
  266. I. Beltagy et al. Scibert: A pretrained language model for scientific text. arXiv:1903.10676, 2019.
  267. K. Huang et al. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv:1904.05342, 2019.
  268. J. Ba et al. Multiple object recognition with visual attention. In ICLR, 2014.
  269. V. Mnih et al. Recurrent models of visual attention. NeurIPS, pp. 2204–2212, 2014.
  270. K. Xu et al. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057, 2015.
  271. F. Wang et al. Residual attention network for image classification. In CVPR, pp. 3156–3164, 2017.
  272. S. Jetley et al. Learn to pay attention. In ICLR, 2018.
  273. K. Han et al. Attribute-aware attention model for fine-grained representation learning. In ACM MM, pp. 2040–2048, 2018.
  274. P. Ramachandran et al. Stand-alone self-attention in vision models. In NeurIPS, 2019.
  275. Q. Guan et al. Diagnose like a radiologist: Attention guided convolutional neural network for thorax disease classification. In arXiv:1801.09927, 2018.
  276. J. Hu et al. Squeeze-and-excitation networks. In CVPR, pp. 7132–7141, 2018.
  277. H. Zhao et al. Psanet: Point-wise spatial attention network for scene parsing. In ECCV, pp. 267–283, 2018.
  278. Y. Yuan et al. Ocnet: Object context for semantic segmentation. International Journal of Computer Vision, pp. 1–24, 2021.
  279. J. Fu et al. Dual attention network for scene segmentation. In CVPR, pp. 3146–3154, 2019.
  280. H. Zhang et al. Co-occurrent features in semantic segmentation. In CVPR, pp. 548–557, 2019.
  281. F. Zhang et al. Acfnet: Attentional class feature network for semantic segmentation. In ICCV, pp. 6798–6807, 2019.
  282. X. Li et al. Expectation-maximization attention networks for semantic segmentation. In ICCV, pp. 9167–9176, 2019.
  283. J. He et al. Adaptive pyramid context network for semantic segmentation. In CVPR, pp. 7519–7528, 2019.
  284. O. Oktay et al. Attention u-net: Learning where to look for the pancreas. 2018.
  285. Y. Wang et al. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In CVPR, pp. 12275–12284, 2020.
  286. X. Li et al. Global aggregation then local distribution in fully convolutional networks. In BMVC, 2019.
  287. Y. Chen et al. A^ 2-nets: Double attention networks. NeurIPS, pp. 352–361, 2018.
  288. L. Zhang et al. Dual graph convolutional network for semantic segmentation. In BMVC, 2019.
  289. K. Yue et al. Compact generalized non-local network. In NeurIPS, pp. 6510–6519, 2018.
  290. Z. Huang et al. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, pp. 603–612, 2019.
  291. L. Huang et al. Interlaced sparse self-attention for semantic segmentation. arXiv:1907.12273, 2019.
  292. Y. Li and A. Gupta. Beyond grids: Learning graph representations for visual recognition. NeurIPS, pp. 9225–9235, 2018.
  293. S. Kumaar et al. Cabinet: Efficient context aggregation network for low-latency semantic segmentation. arXiv:2011.00993, 2020.
  294. X. Liang et al. Symbolic graph reasoning meets convolutions. NeurIPS, pp. 1853–1863, 2018.
  295. Y. Chen et al. Graph-based global reasoning networks. In CVPR, pp. 433–442, 2019.
  296. T.-Y. Lin et al. Microsoft coco: Common objects in context. In ECCV, pp. 740–755, 2014.
  297. Y. Cao et al. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In ICCV Workshops, 2019.
  298. W. Li et al. Object detection based on an adaptive attention mechanism. Scientific Reports, pp. 1–13, 2020.
  299. T.-I. Hsieh et al. One-shot object detection with co-attention and co-excitation. In NeurIPS, pp. 2725–2734, 2019.
  300. Q. Fan et al. Few-shot object detection with attention-rpn and multi-relation detector. In CVPR, pp. 4013–4022, 2020.
  301. H. Perreault et al. Spotnet: Self-attention multi-task network for object detection. In 2020 17th Conference on Computer and Robot Vision (CRV), pp. 230–237, 2020.
  302. X.-T. Vo et al. Bidirectional non-local networks for object detection. In International Conference on Computational Collective Intelligence, pp. 491–501, 2020.
  303. H. Hu et al. Relation networks for object detection. In CVPR, pp. 3588–3597, 2018.
  304. K. Zhang et al. Learning enhanced resolution-wise features for human pose estimation. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 2256–2260, 2020.
  305. Y. Chang et al. The same size dilated attention network for keypoint detection. In International Conference on Artificial Neural Networks, pp. 471–483, 2019.
  306. A. Johnston and G. Carneiro. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In CVPR, pp. 4756–4765, 2020.
  307. Y. Chen et al. Attention-based context aggregation network for monocular depth estimation. International Journal of Machine Learning and Cybernetics, pp. 1583–1596, 2021.
  308. S. Aich et al. Bidirectional attention network for monocular depth estimation. In ICRA, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Kai Han (184 papers)
  2. Yunhe Wang (145 papers)
  3. Hanting Chen (52 papers)
  4. Xinghao Chen (66 papers)
  5. Jianyuan Guo (40 papers)
  6. Zhenhua Liu (47 papers)
  7. Yehui Tang (63 papers)
  8. An Xiao (7 papers)
  9. Chunjing Xu (66 papers)
  10. Yixing Xu (25 papers)
  11. Zhaohui Yang (193 papers)
  12. Yiman Zhang (5 papers)
  13. Dacheng Tao (826 papers)
Citations (1,687)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets