Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interpreting and Improving Attention From the Perspective of Large Kernel Convolution (2401.05738v3)

Published 11 Jan 2024 in cs.CV

Abstract: Attention mechanisms have significantly advanced visual models by capturing global context effectively. However, their reliance on large-scale datasets and substantial computational resources poses challenges in data-scarce and resource-constrained scenarios. Moreover, traditional self-attention mechanisms lack inherent spatial inductive biases, making them suboptimal for modeling local features critical to tasks involving smaller datasets. In this work, we introduce Large Kernel Convolutional Attention (LKCA), a novel formulation that reinterprets attention operations as a single large-kernel convolution. This design unifies the strengths of convolutional architectures locality and translation invariance with the global context modeling capabilities of self-attention. By embedding these properties into a computationally efficient framework, LKCA addresses key limitations of traditional attention mechanisms. The proposed LKCA achieves competitive performance across various visual tasks, particularly in data-constrained settings. Experimental results on CIFAR-10, CIFAR-100, SVHN, and Tiny-ImageNet demonstrate its ability to excel in image classification, outperforming conventional attention mechanisms and vision transformers in compact model settings. These findings highlight the effectiveness of LKCA in bridging local and global feature modeling, offering a practical and robust solution for real-world applications with limited data and resources.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
  2. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012)
  3. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  4. Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2021) 7373–7382
  5. Volo: Vision outlooker for visual recognition. IEEE transactions on pattern analysis and machine intelligence 45(5) (2022) 6575–6586
  6. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2022) 16000–16009
  7. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 568–578
  8. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)
  9. On the relationship between self-attention and convolutional layers. arXiv preprint arXiv:1911.03584 (2019)
  10. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems 34 (2021) 12116–12128
  11. Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, PMLR (2021) 4904–4916
  12. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2022) 11963–11975
  13. Zhang, R.: Making convolutional networks shift-invariant again. In: International conference on machine learning, PMLR (2019) 7324–7334
  14. Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021)
  15. Visual attention network. Computational Visual Media 9(4) (2023) 733–752
  16. Nadaraya, E.A.: On estimating regression. Theory of Probability & Its Applications 9(1) (1964) 141–142
  17. Watson, G.S.: Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A (1964) 359–372
  18. Attention in natural language processing. IEEE transactions on neural networks and learning systems 32(10) (2020) 4291–4308
  19. Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia 17(11) (2015) 1875–1886
  20. Survey on the attention based rnn model and its applications in computer vision. arXiv preprint arXiv:1601.06823 (2016)
  21. Attention mechanisms in computer vision: A survey. Computational visual media 8(3) (2022) 331–368
  22. Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2020) 11534–11542
  23. Context encoding for semantic segmentation. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (2018) 7151–7160
  24. You look twice: Gaternet for dynamic filter selection in cnns. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2019) 9172–9180
  25. Spsequencenet: Semantic segmentation network on 4d point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2020) 4574–4583
  26. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018)
  27. Self-attention generative adversarial networks. In: International conference on machine learning, PMLR (2019) 7354–7363
  28. Recurrent models of visual attention. Advances in neural information processing systems 27 (2014)
  29. Scan: Self-and-collaborative attention network for video person re-identification. IEEE Transactions on Image Processing 28(10) (2019) 4870–4882
  30. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2018) 1169–1178
  31. Selective kernel networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2019) 510–519
  32. Resnest: Split-attention networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2022) 2736–2746
  33. Dynamic convolution: Attention over convolution kernels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2020) 11030–11039
  34. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2017) 5659–5667
  35. Bam: Bottleneck attention module. arXiv preprint arXiv:1807.06514 (2018)
  36. Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2017) 3156–3164
  37. Rotate to attend: Convolutional triplet attention module. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. (2021) 3139–3148
  38. Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2019) 3146–3154
  39. Harmonious attention network for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2018) 2285–2294
  40. Stat: Spatial-temporal attention mechanism for video captioning. IEEE transactions on multimedia 22(1) (2019) 229–241
  41. Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2018) 7794–7803
  42. Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV). (2018) 3–19
  43. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2017) 4700–4708
  44. Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence. Volume 31. (2017)
  45. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 2818–2826
  46. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 1–9
  47. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  48. Large kernel matters–improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2017) 4353–4361
  49. Local relation networks for image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. (2019) 3464–3473
  50. Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)
  51. Attention is all you need. Advances in neural information processing systems 30 (2017)
  52. Transformer in transformer. Advances in Neural Information Processing Systems 34 (2021) 15908–15919
  53. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 10012–10022
  54. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022) 12124–12134
  55. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems 34 (2021) 9355–9366
  56. Cat: Cross attention in vision transformer. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), IEEE (2022) 1–6
  57. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650 (2021)
  58. Msg-transformer: Exchanging local spatial information by manipulating messenger tokens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022) 12063–12072
  59. Regionvit: Regional-to-local attention for vision transformers. arXiv preprint arXiv:2106.02689 (2021)
  60. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)
  61. Kvt: k-nn attention for boosting vision transformers. In: European conference on computer vision, Springer (2022) 285–302
  62. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  63. Deep networks with stochastic depth. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer (2016) 646–661
  64. Random erasing data augmentation. In: Proceedings of the AAAI conference on artificial intelligence. Volume 34. (2020) 13001–13008
  65. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)
  66. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
  67. Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF international conference on computer vision. (2019) 6023–6032
  68. Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2019) 113–123
  69. Going deeper with image transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 32–42
  70. Separable self-attention for mobile vision transformers. arXiv preprint arXiv:2206.02680 (2022)
  71. Cvt: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 22–31
  72. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 558–567
  73. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 568 (2024) 127063
  74. Xcit: Cross-covariance image transformers. Advances in neural information processing systems 34 (2021) 20014–20027
  75. Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 357–366
  76. Contributors, M.: MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation (2020)
  77. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision 127 (2019) 302–321
  78. Unified perceptual parsing for scene understanding. In: Proceedings of the European conference on computer vision (ECCV). (2018) 418–434
  79. Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2019) 6399–6408
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Chenghao Li (37 papers)
  2. Boheng Zeng (3 papers)
  3. Yi Lu (145 papers)
  4. Pengbo Shi (1 paper)
  5. Qingzi Chen (1 paper)
  6. Jirui Liu (2 papers)
  7. Lingyun Zhu (2 papers)
  8. Chaoning Zhang (66 papers)
  9. Yang Yang (884 papers)
  10. Heng Tao Shen (117 papers)
Github Logo Streamline Icon: https://streamlinehq.com