Shift-ConvNets: Small Convolutional Kernel with Large Kernel Effects (2401.12736v1)
Abstract: Recent studies reveal that the remarkable performance of Vision transformers (ViTs) benefits from large receptive fields. For this reason, the large convolutional kernel design becomes an ideal solution to make Convolutional Neural Networks (CNNs) great again. However, the typical large convolutional kernels turn out to be hardware-unfriendly operators, resulting in discount compatibility of various hardware platforms. Thus, it is unwise to simply enlarge the convolutional kernel size. In this paper, we reveal that small convolutional kernels and convolution operations can achieve the closing effects of large kernel sizes. Then, we propose a shift-wise operator that ensures the CNNs capture long-range dependencies with the help of the sparse mechanism, while remaining hardware-friendly. Experimental results show that our shift-wise operator significantly improves the accuracy of a regular CNN while markedly reducing computational requirements. On the ImageNet-1k, our shift-wise enhanced CNN model outperforms the state-of-the-art models. Code & models at https://github.com/lidc54/shift-wiseConv.
- Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society (2016) 770–778
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4) (2018) 834–848
- Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series and image recognition (2023)
- Geometry-aware guided loss for deep crack recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022) 4703–4712
- The devil is in the crack orientation: A new perspective for crack detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. (2023) 6653–6663
- A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2022) 11976–11986
- Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2022) 11963–11975
- More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. arXiv preprint arXiv:2207.03620 (2022)
- Imagenet classification with deep convolutional neural networks. In Bartlett, P.L., Pereira, F.C.N., Burges, C.J.C., Bottou, L., Weinberger, K.Q., eds.: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. (2012) 1106–1114
- Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, IEEE Computer Society (2015) 1–9
- Very deep convolutional networks for large-scale image recognition. In Bengio, Y., LeCun, Y., eds.: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. (2015)
- Bag of tricks for image classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Computer Vision Foundation / IEEE (2019) 558–567
- Pyramid scene parsing network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society (2017) 6230–6239
- Scale-aware trident networks for object detection. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, IEEE (2019) 6053–6062
- Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, IEEE Computer Society (2015) 3431–3440
- Segnext: Rethinking convolutional attention design for semantic segmentation (2022)
- Large kernel matters - improve semantic segmentation by global convolutional network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society (2017) 1743–1751
- Bilinear cnns for fine-grained visual recognition (2017)
- Gated-scnn: Gated shape cnns for semantic segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, IEEE (2019) 5228–5237
- Dual attention network for scene segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Computer Vision Foundation / IEEE (2019) 3146–3154
- Non-local neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, IEEE Computer Society (2018) 7794–7803
- Deformable convolutional networks. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, IEEE Computer Society (2017) 764–773
- Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation (2023)
- Link: Linear kernel for lidar-based 3d perception (2023)
- Pointrend: Image segmentation as rendering. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE (2020) 9796–9805
- Parcnetv2: Oversized kernel with enhanced attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. (2023) 5752–5762
- A survey of transformers (2021)
- Visual attention network. Computational Visual Media 9(4) (2023) 733–752
- Squeeze-and-excitation networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, IEEE Computer Society (2018) 7132–7141
- Internimage: Exploring large-scale vision foundation models with deformable convolutions (2023)
- Dilated convolution with learnable spacings. arXiv preprint arXiv:2112.03740 (2021)
- Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert Systems with Applications 236 (2024) 121352
- Convolutional networks with oriented 1d kernels. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. (2023) 6222–6232
- Beyond self-attention: Deformable large kernel attention for medical image segmentation (2023)
- Are large kernels better teachers than transformers for convnets? arXiv preprint arXiv:2305.19412 (2023)
- Shift: A zero flop, zero parameter alternative to spatial convolutions. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, IEEE Computer Society (2018) 9127–9135
- Constructing fast network through deconstruction of convolution. In Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., eds.: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada. (2018) 5955–5965
- All you need is a few shifts: Designing efficient convolutional neural networks for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Computer Vision Foundation / IEEE (2019) 7241–7250
- On the integration of self-attention and convolution (2022)
- Skeleton-based action recognition with shift graph convolutional network. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE (2020) 180–189
- X-volution: On the unification of convolution and self-attention (2021)
- Akconv: Convolutional kernel with arbitrary sampled shapes and arbitrary number of parameters (2023)
- Ghostnet: More features from cheap operations. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, IEEE (2020) 1577–1586
- Cspnet: A new backbone that can enhance learning capability of cnn. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). (2020) 1571–1580
- Run, don’t walk: Chasing higher flops for faster neural networks. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (2023) 12021–12031
- Repvgg: Making vgg-style convnets great again. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2021) 13733–13742
- Expandnets: Linear over-parameterization to train compact convolutional networks. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., eds.: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. (2020)
- Mobileone: An improved one millisecond mobile backbone (2023)
- Vanillanet: the power of minimalism in deep learning (2023)
- Lart: Five implementation strategies of the spatial-shift-operation. https://www.yuque.com/lart/ugkv9f/nnor5p 2022-05-18.
- Taichi: a language for high-performance computation on spatially sparse data structures. ACM Transactions on Graphics (TOG) 38(6) (2019) 201
- Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021) 10012–10022
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022) 12124–12134