A ConvNet for the 2020s (2201.03545v2)

Published 10 Jan 2022 in cs.CV

Abstract: The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

Authors (6)

Zhuang Liu (63 papers)
Hanzi Mao (8 papers)
Chao-Yuan Wu (19 papers)
Christoph Feichtenhofer (52 papers)
Trevor Darrell (324 papers)
Saining Xie (60 papers)

Citations (4,194)

View on Semantic Scholar

Summary

A ConvNet for the 2020s: Insights and Implications

The paper "A ConvNet for the 2020s" by Zhuang Liu et al. marks a detailed and systematic exploration into the potential of convolutional neural networks (ConvNets) in the domain of visual recognition tasks, notably succeeding in drawing performance parallels with hierarchical Vision Transformers (ViTs) like the Swin Transformer. At its core, the research revisits and retrofits a standard ResNet architecture, resulting in ConvNeXt, a family of pure ConvNet models designed with the sophistication required to compete against state-of-the-art ViTs without incorporating attention-based modules.

Key Contributions and Observations

The paper methodically addresses the perceived obsolescence of ConvNets in favor of ViTs. It contends that the prevailing assumption attributing performance superiority to intrinsic Transformer properties can be reevaluated by modernizing the ConvNet design. This involves integrating design principles borrowed from ViTs but retaining the simplicity and efficiency characteristic of ConvNets.

Strategic Enhancements: The research journey begins with a standard ResNet-50, progressively embedding design choices inspired by vision Transformers like Swin-T. These include:

Improved Training Techniques: The initial enhancement of using advanced training techniques such as the AdamW optimizer, and data augmentations like Mixup, CutMix, RandAugment among others, pushing the baseline ResNet-50 accuracy from 76.1% to 78.8%.
Architecture Modernization: Key modifications include:
- Stage Compute Ratio: Adjusting the computational burden across network stages to mirror those of ViTs, yielding a small but significant accuracy boost.
- Patchify Stem Cell: Replacing the traditional convolutional stem with a patchify approach akin to ViTs, enhancing simplicity and performance.
- Depthwise Convolution and Width Expansion: Embracing grouped convolutions (specifically depthwise convolutions) and expanding network width thereby achieving an improved trade-off benefiting both accuracy and computational efficiency.
- Inverted Bottleneck: Implementing an inverted bottleneck structure that echoes MLP block design in Transformers.
- Large Kernel Sizes: Revisiting the use of larger convolutional kernels (e.g., 7x7) enhancing the receptive field much like the non-local nature in ViTs.
Micro-level Adjustments: These adjustments further optimize the architecture:
- GELU Activations and Reduced Normalization Layers: Adopting GELU activations and reducing the number of activation and normalization layers, in preference to LayerNorm over BatchNorm, simplifying the operational complexity without sacrificing performance.

Empirical Evaluation

The suite of ConvNeXt models (ConvNeXt-T/S/B/L/XL) demonstrates compelling results across various benchmarks:

ImageNet Classification: ConvNeXt variants achieve top-1 accuracies ranging from 82.1% to 87.8%, showing improvements over their Swin counterparts while maintaining or enhancing throughput.
Downstream Tasks: ConvNeXt models outperform on tasks such as COCO object detection and ADE20K semantic segmentation, demonstrating robustness and scalability across different vision applications.

Implications and Speculations on AI Developments

Practical Implications: The simplicity and efficiency of ConvNeXt architectures suggest a resurgence of interest in optimized ConvNets for real-world applications where computational resources and deployment efficiency are critical. ConvNeXt, with comparable or superior performance, positions itself as a viable alternative to the more computation-heavy ViTs.

Theoretical Implications: By thoroughly analyzing and incorporating ViT design elements into ConvNets, the research underscores the enduring relevance of convolutional operations in modern neural network designs. It contests the premature dismissal of ConvNets and suggests that fundamental improvements and modern training techniques can rejuvenate well-structured architectures to meet current performance standards.

Future Directions: The results open avenues for further exploration into hybrid models that capitalize on the best features of both ConvNets and Transformers. Additionally, the implications span into domains extending beyond image recognition such as multi-modal learning and sparse data handling where a balanced architecture approach could yield substantial benefits.

Conclusion

The paper exemplifies a methodical, evidence-based approach to architectural innovation. ConvNeXt demonstrates that ConvNets, significantly reimagined and optimized, are capable of matching or even exceeding the performance of leading ViTs like Swin Transformer. The work challenges established views, highlighting that certain perceived Transformer advantages can be matched with robust ConvNet designs, advocating for a balanced reconsideration of these foundational architectures in the context of evolving AI demands.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/woojinrad/status/1768265329195549142

https://twitter.com/lofiMRI/status/1796530879877005522

https://twitter.com/ThomasDagonneau/status/1847290425071255622

https://twitter.com/panabee/status/1824604415426433453

https://twitter.com/imabit_inc/status/1833306208880505217

YouTube

Show All Videos

HackerNews

A ConvNet for the 2020s (18 points, 0 comments)