Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution (2403.07589v2)

Published 12 Mar 2024 in cs.CV

Abstract: Recently, some large kernel convnets strike back with appealing performance and efficiency. However, given the square complexity of convolution, scaling up kernels can bring about an enormous amount of parameters and the proliferated parameters can induce severe optimization problem. Due to these issues, current CNNs compromise to scale up to 51x51 in the form of stripe convolution (i.e., 51x5 + 5x51) and start to saturate as the kernel size continues growing. In this paper, we delve into addressing these vital issues and explore whether we can continue scaling up kernels for more performance gains. Inspired by human vision, we propose a human-like peripheral convolution that efficiently reduces over 90% parameter count of dense grid convolution through parameter sharing, and manage to scale up kernel size to extremely large. Our peripheral convolution behaves highly similar to human, reducing the complexity of convolution from O(K2) to O(logK) without backfiring performance. Built on this, we propose Parameter-efficient Large Kernel Network (PeLK). Our PeLK outperforms modern vision Transformers and ConvNet architectures like Swin, ConvNeXt, RepLKNet and SLaK on various vision tasks including ImageNet classification, semantic segmentation on ADE20K and object detection on MS COCO. For the first time, we successfully scale up the kernel size of CNNs to an unprecedented 101x101 and demonstrate consistent improvements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Megengine:a fast, scalable and easy-to-use deep learning framework. https://github.com/MegEngine/MegEngine, 2020.
  2. A summary-statistic representation in peripheral vision explains visual crowding. Journal of vision, 9(12):13–13, 2009.
  3. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
  4. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  5. Largekernel3d: Scaling up kernels in 3d sparse cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13488–13498, 2023.
  6. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34:9355–9366, 2021.
  7. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  8. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  10. Can peripheral representations improve clutter metrics on complex scenes? Advances in neural information processing systems, 29, 2016.
  11. Emergent properties of foveated perceptual systems. arXiv preprint arXiv:2006.07991, 2020.
  12. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11963–11975, 2022.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Sideeye: A generative neural network based simulator of human peripheral vision. arXiv preprint arXiv:1706.04568, 2017.
  15. Peripheral-foveal vision for real-time object recognition and tracking in video. 2007.
  16. Finding biological plausibility for adversarially robust features via metameric tasks. In SVRHM 2021 Workshop@ NeurIPS, 2021.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  18. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  19. Foveater: Foveated transformer for image classification. arXiv preprint arXiv:2105.14173, 2021.
  20. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  21. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  22. Jerome Y Lettvin et al. On seeing sidelong. The Sciences, 16(4):10–20, 1976.
  23. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  24. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. arXiv preprint arXiv:2207.03620, 2022a.
  25. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  26. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022b.
  27. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  28. Object recognition test in peripheral vision: a study on the influence of object color, pattern and shape. In Brain Informatics: International Conference, BI 2012, Macau, China, December 4-7, 2012. Proceedings, pages 18–26. Springer, 2012.
  29. Biologically inspired deep learning model for efficient foveal-peripheral vision. Frontiers in Computational Neuroscience, 15:746204, 2021.
  30. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
  31. Human anatomy & physiology. Pearson education, 2007.
  32. Peripheral vision transformer. Advances in Neural Information Processing Systems, 35:32097–32111, 2022a.
  33. Peripheral vision transformer. Advances in Neural Information Processing Systems, 35:32097–32111, 2022b.
  34. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2017.
  35. Human peripheral blur is optimal for object recognition. Vision research, 200:108083, 2022.
  36. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems, 34:12116–12128, 2021.
  37. Ruth Rosenholtz. Capabilities and limitations of peripheral vision. Annual review of vision science, 2:437–457, 2016.
  38. Ruth Rosenholtz. Demystifying visual awareness: Peripheral encoding plus limited decision complexity resolve the paradox of rich visual experience and curious perceptual failures. Attention, Perception, & Psychophysics, 82(3):901–925, 2020.
  39. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  40. Peripheral vision and pattern recognition: A review. Journal of vision, 11(5):13–13, 2011.
  41. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015a.
  42. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015b.
  43. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  44. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
  45. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021a.
  46. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021b.
  47. Patches are all you need? arXiv preprint arXiv:2201.09792, 2022.
  48. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  49. Biologically inspired mechanisms for adversarial robustness. Advances in Neural Information Processing Systems, 33:2135–2146, 2020.
  50. Central and peripheral vision for scene recognition: A neurocomputational modeling exploration. Journal of vision, 17(4):9–9, 2017.
  51. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.
  52. The role of central and peripheral vision in perceiving the direction of self-motion. Perception & psychophysics, 51(5):443–454, 1992.
  53. Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  54. Context mitigates crowding: Peripheral object recognition in real-world images. Cognition, 180:158–164, 2018.
  55. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018a.
  56. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018b.
  57. Inceptionnext: When inception meets convnext. arXiv preprint arXiv:2303.16900, 2023.
  58. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pages 558–567, 2021.
  59. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  60. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  61. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018.
  62. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, pages 13001–13008, 2020.
  63. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Honghao Chen (5 papers)
  2. Xiangxiang Chu (62 papers)
  3. Yongjian Ren (1 paper)
  4. Xin Zhao (160 papers)
  5. Kaiqi Huang (60 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.