Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets (2404.04924v1)

Published 7 Apr 2024 in cs.CV and cs.AI

Abstract: Vision Transformers (ViTs) have achieved impressive results in large-scale image classification. However, when training from scratch on small datasets, there is still a significant performance gap between ViTs and Convolutional Neural Networks (CNNs), which is attributed to the lack of inductive bias. To address this issue, we propose a Graph-based Vision Transformer (GvT) that utilizes graph convolutional projection and graph-pooling. In each block, queries and keys are calculated through graph convolutional projection based on the spatial adjacency matrix, while dot-product attention is used in another graph convolution to generate values. When using more attention heads, the queries and keys become lower-dimensional, making their dot product an uninformative matching function. To overcome this low-rank bottleneck in attention heads, we employ talking-heads technology based on bilinear pooled features and sparse selection of attention tensors. This allows interaction among filtered attention scores and enables each attention mechanism to depend on all queries and keys. Additionally, we apply graph-pooling between two intermediate blocks to reduce the number of tokens and aggregate semantic information more effectively. Our experimental results show that GvT produces comparable or superior outcomes to deep convolutional networks and surpasses vision transformers without pre-training on large datasets. The code for our proposed model is publicly available on the website.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. CoRR, abs/1803.01271, 2018.
  2. Hierarchical graph convolutional skeleton transformer for action recognition. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 01–06. IEEE, 2022.
  3. Low-rank bottleneck in multi-head attention models. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
  4. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
  5. Do we really need explicit position encodings for vision transformers? CoRR, abs/2102.10882, 2021.
  6. Ernie Croot. The rayleigh principle for finding eigenvalues. Technical report, Tech. rep. Online, Accessed: March 2019. Georgia Institute of Technology, 2005.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020.
  8. Investigating the vision transformer model for image retrieval tasks. CoRR, abs/2101.03771, 2021.
  9. Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704, 2021.
  10. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  11. Identifying pneumonia in chest x-rays: A deep learning approach. Measurement, 145:511–518, 2019.
  12. Alex Krizhevsky. Learning multiple layers of features from tiny images. pages 32–33, 2009.
  13. Hashformer: Vision transformer based deep hashing for image retrieval. IEEE Signal Processing Letters, 29:827–831, 2022.
  14. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.
  15. Efficient training of visual transformers with small datasets. Advances in Neural Information Processing Systems, 34:23818–23830, 2021.
  16. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  17. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
  18. Decoupled weight decay regularization, 2017. cite arxiv:1711.05101Comment: Published as a conference paper at ICLR 2019.
  19. Bridging the gap between vision transformers and convolutional neural networks on small datasets. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  20. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  21. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1406–1415, 2019.
  22. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems, 34:12116–12128, 2021.
  23. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  24. Drrnets: Dynamic recurrent routing via low-rank regularization in recurrent neural networks. IEEE Transactions on Neural Networks and Learning Systems, pages 1–11, 2021.
  25. Talking-heads attention. arXiv preprint arXiv:2003.02436, 2020.
  26. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  27. Rethinking transformer-based set prediction for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3611–3620, 2021.
  28. Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2021.
  29. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  30. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc.
  31. Second-order pooling for graph neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  32. Glue: A multi-task benchmark and analysis platform for natural language understanding. 2019. 7th International Conference on Learning Representations, ICLR 2019 ; Conference date: 06-05-2019 Through 09-05-2019.
  33. End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8741–8750, 2021.
  34. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020.
  35. Cvt: Introducing convolutions to vision transformers. CoRR, abs/2103.15808, 2021.
  36. Representing long-range context for graph neural networks with global attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  37. Breaking the softmax bottleneck: A high-rank RNN language model. CoRR, abs/1711.03953, 2017.
  38. Interleaved group convolutions for deep neural networks. CoRR, abs/1707.02725, 2017.
  39. Covid-ct-dataset: a ct scan dataset about covid-19. arXiv preprint arXiv:2003.13865, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Dongjing Shan (1 paper)
  2. guiqiang chen (1 paper)