Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs (2203.06717v4)

Published 13 Mar 2022 in cs.CV, cs.AI, and cs.LG

Abstract: We revisit large kernel design in modern convolutional neural networks (CNNs). Inspired by recent advances in vision transformers (ViTs), in this paper, we demonstrate that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm. We suggested five guidelines, e.g., applying re-parameterized large depth-wise convolutions, to design efficient high-performance large-kernel CNNs. Following the guidelines, we propose RepLKNet, a pure CNN architecture whose kernel size is as large as 31x31, in contrast to commonly used 3x3. RepLKNet greatly closes the performance gap between CNNs and ViTs, e.g., achieving comparable or superior results than Swin Transformer on ImageNet and a few typical downstream tasks, with lower latency. RepLKNet also shows nice scalability to big data and large models, obtaining 87.8% top-1 accuracy on ImageNet and 56.0% mIoU on ADE20K, which is very competitive among the state-of-the-arts with similar model sizes. Our study further reveals that, in contrast to small-kernel CNNs, large-kernel CNNs have much larger effective receptive fields and higher shape bias rather than texture bias. Code & models at https://github.com/megvii-research/RepLKNet.

Citations (468)

View on Semantic Scholar

Summary

The paper introduces RepLKNet, a pure CNN architecture using 31x31 kernels that achieves 87.8% top-1 accuracy on ImageNet.
It employs re-parameterization techniques and identity shortcuts to stabilize training and efficiently leverage large depth-wise convolutions.
The study challenges conventional small-kernel designs by showing that larger receptive fields enhance shape-based representations and overall model performance.

Revisiting Large Kernel Design in CNNs: Enhancements and Implications

In the paper "Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs," the authors explore the use of large convolutional kernels in modern convolutional neural networks (CNNs) in contrast to the commonly employed small kernel stacks. This investigation draws inspiration from the successes of Vision Transformers (ViTs), which utilize mechanisms allowing for large receptive fields. The paper introduces RepLKNet, a pure CNN architecture that implements convolutional kernels as large as 31x31, significantly diverging from the standard 3x3 kernels.

Methodology and Findings

The authors propose five guidelines to effectively deploy large convolutional kernels in CNN architectures. Key takeaways include the benefits of applying re-parameterized large depth-wise convolutions, maintaining identity shortcuts to prevent optimization issues, leveraging structural re-parameterization to reintroduce small kernels for support, and the observed superior performance of large kernels in downstream tasks compared to their performance on ImageNet classification. This investigation suggests that the superior performance may stem from larger effective receptive fields and increased shape bias over texture bias, enhancing CNNs' capability in complex visual tasks.

Experimental results demonstrate that the RepLKNet architecture achieves comparable or superior performance to established models like Swin Transformer on ImageNet and other downstream tasks, such as semantic segmentation and object detection. RepLKNet reaches an impressive 87.8% top-1 accuracy on ImageNet and 56.0% mIoU on ADE20K. When compared to traditional small-kernel CNNs, RepLKNet exhibits larger effective receptive fields, which contributes to its ability to handle shape-based representations more effectively.

Practical and Theoretical Implications

From a practical standpoint, the adoption of large kernels with re-parameterization techniques presents a promising avenue to enhance CNNs' efficiency without incurring excessive computational costs. The results indicate improvements in both model accuracy and latency, showcasing a viable pathway to bridge the performance gap between CNNs and ViTs.

Theoretically, this paper challenges the dominance of small kernels and suggests a reevaluation of CNN design principles, specifically in terms of receptive field construction. By achieving larger receptive fields with fewer layers, the paper offers a new perspective on optimizing CNN architectures for complex pattern recognition tasks.

Future Directions

Looking ahead, the research prompts several interesting directions. Exploring the integration of these insights with other advanced architectures, like ConvNeXt, or extending the concepts to newer hybrid models, could further unify CNNs and transformer-based models' strengths. Additionally, assessing the long-term scalability of large kernels in handling ever-growing datasets remains a crucial area for future exploration.

In conclusion, this paper provides significant contributions to CNN architecture design by reevaluating kernel size in modern networks. It highlights the importance of tailoring CNN architectures to task-specific needs and paves the way for further developments in scalable, efficient deep learning models.

PDF Markdown

Related Papers

GitHub

GitHub - MegEngine/RepLKNet: Official MegEngine implementation of RepLKNet (265 stars)

YouTube

Show All Videos