Interpreting and Improving Attention From the Perspective of Large Kernel Convolution (2401.05738v3)
Abstract: Attention mechanisms have significantly advanced visual models by capturing global context effectively. However, their reliance on large-scale datasets and substantial computational resources poses challenges in data-scarce and resource-constrained scenarios. Moreover, traditional self-attention mechanisms lack inherent spatial inductive biases, making them suboptimal for modeling local features critical to tasks involving smaller datasets. In this work, we introduce Large Kernel Convolutional Attention (LKCA), a novel formulation that reinterprets attention operations as a single large-kernel convolution. This design unifies the strengths of convolutional architectures locality and translation invariance with the global context modeling capabilities of self-attention. By embedding these properties into a computationally efficient framework, LKCA addresses key limitations of traditional attention mechanisms. The proposed LKCA achieves competitive performance across various visual tasks, particularly in data-constrained settings. Experimental results on CIFAR-10, CIFAR-100, SVHN, and Tiny-ImageNet demonstrate its ability to excel in image classification, outperforming conventional attention mechanisms and vision transformers in compact model settings. These findings highlight the effectiveness of LKCA in bridging local and global feature modeling, offering a practical and robust solution for real-world applications with limited data and resources.
- Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012)
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
- Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2021) 7373–7382
- Volo: Vision outlooker for visual recognition. IEEE transactions on pattern analysis and machine intelligence 45(5) (2022) 6575–6586
- Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2022) 16000–16009
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 568–578
- Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)
- On the relationship between self-attention and convolutional layers. arXiv preprint arXiv:1911.03584 (2019)
- Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems 34 (2021) 12116–12128
- Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, PMLR (2021) 4904–4916
- Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2022) 11963–11975
- Zhang, R.: Making convolutional networks shift-invariant again. In: International conference on machine learning, PMLR (2019) 7324–7334
- Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021)
- Visual attention network. Computational Visual Media 9(4) (2023) 733–752
- Nadaraya, E.A.: On estimating regression. Theory of Probability & Its Applications 9(1) (1964) 141–142
- Watson, G.S.: Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A (1964) 359–372
- Attention in natural language processing. IEEE transactions on neural networks and learning systems 32(10) (2020) 4291–4308
- Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia 17(11) (2015) 1875–1886
- Survey on the attention based rnn model and its applications in computer vision. arXiv preprint arXiv:1601.06823 (2016)
- Attention mechanisms in computer vision: A survey. Computational visual media 8(3) (2022) 331–368
- Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2020) 11534–11542
- Context encoding for semantic segmentation. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (2018) 7151–7160
- You look twice: Gaternet for dynamic filter selection in cnns. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2019) 9172–9180
- Spsequencenet: Semantic segmentation network on 4d point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2020) 4574–4583
- Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018)
- Self-attention generative adversarial networks. In: International conference on machine learning, PMLR (2019) 7354–7363
- Recurrent models of visual attention. Advances in neural information processing systems 27 (2014)
- Scan: Self-and-collaborative attention network for video person re-identification. IEEE Transactions on Image Processing 28(10) (2019) 4870–4882
- Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2018) 1169–1178
- Selective kernel networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2019) 510–519
- Resnest: Split-attention networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2022) 2736–2746
- Dynamic convolution: Attention over convolution kernels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2020) 11030–11039
- Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2017) 5659–5667
- Bam: Bottleneck attention module. arXiv preprint arXiv:1807.06514 (2018)
- Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2017) 3156–3164
- Rotate to attend: Convolutional triplet attention module. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. (2021) 3139–3148
- Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2019) 3146–3154
- Harmonious attention network for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2018) 2285–2294
- Stat: Spatial-temporal attention mechanism for video captioning. IEEE transactions on multimedia 22(1) (2019) 229–241
- Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2018) 7794–7803
- Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV). (2018) 3–19
- Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2017) 4700–4708
- Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence. Volume 31. (2017)
- Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 2818–2826
- Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 1–9
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
- Large kernel matters–improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2017) 4353–4361
- Local relation networks for image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. (2019) 3464–3473
- Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)
- Attention is all you need. Advances in neural information processing systems 30 (2017)
- Transformer in transformer. Advances in Neural Information Processing Systems 34 (2021) 15908–15919
- Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 10012–10022
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022) 12124–12134
- Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems 34 (2021) 9355–9366
- Cat: Cross attention in vision transformer. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), IEEE (2022) 1–6
- Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650 (2021)
- Msg-transformer: Exchanging local spatial information by manipulating messenger tokens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022) 12063–12072
- Regionvit: Regional-to-local attention for vision transformers. arXiv preprint arXiv:2106.02689 (2021)
- Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)
- Kvt: k-nn attention for boosting vision transformers. In: European conference on computer vision, Springer (2022) 285–302
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
- Deep networks with stochastic depth. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer (2016) 646–661
- Random erasing data augmentation. In: Proceedings of the AAAI conference on artificial intelligence. Volume 34. (2020) 13001–13008
- Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)
- mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF international conference on computer vision. (2019) 6023–6032
- Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2019) 113–123
- Going deeper with image transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 32–42
- Separable self-attention for mobile vision transformers. arXiv preprint arXiv:2206.02680 (2022)
- Cvt: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 22–31
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 558–567
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 568 (2024) 127063
- Xcit: Cross-covariance image transformers. Advances in neural information processing systems 34 (2021) 20014–20027
- Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 357–366
- Contributors, M.: MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation (2020)
- Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision 127 (2019) 302–321
- Unified perceptual parsing for scene understanding. In: Proceedings of the European conference on computer vision (ECCV). (2018) 418–434
- Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2019) 6399–6408
- Chenghao Li (37 papers)
- Boheng Zeng (3 papers)
- Yi Lu (145 papers)
- Pengbo Shi (1 paper)
- Qingzi Chen (1 paper)
- Jirui Liu (2 papers)
- Lingyun Zhu (2 papers)
- Chaoning Zhang (66 papers)
- Yang Yang (884 papers)
- Heng Tao Shen (117 papers)