Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 52 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 107 tok/s Pro

Kimi K2 216 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders (2301.00808v1)

Published 2 Jan 2023 in cs.CV

Abstract: Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.

Citations (519)

View on Semantic Scholar

Collections

Summary

The paper introduces a fully convolutional masked autoencoder integrated with ConvNets to achieve superior self-supervised learning performance.
It employs Global Response Normalization to overcome feature collapse and enhance inter-channel competition, improving representation diversity.
Experimental results show ConvNeXt V2 significantly outperforms previous models on benchmarks like ImageNet, COCO, and ADE20K.

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Introduction

The domain of visual recognition has seen substantial advancements, driven by enhancements in architectures and self-supervised learning frameworks. Specifically, modern convolutional neural networks (ConvNets) such as ConvNeXt have exhibited robust performance across various visual recognition tasks. However, these models, initially optimized for supervised learning scenarios like ImageNet classification, have been less successful when combined with self-supervised learning techniques like Masked Autoencoders (MAE). The paper addresses this by proposing a synergistic framework that integrates fully convolutional masked autoencoders with architectural enhancements, resulting in ConvNeXt V2. This framework succeeds in significantly improving pure ConvNets' performance on multiple benchmarks including ImageNet, COCO, and ADE20K.

Architectural Innovation: Fully Convolutional Masked Autoencoders

The paper proposes an innovative framework that co-designs network architecture and masked autoencoders. The introduction of a fully convolutional masked autoencoder framework enables more efficient processing by utilizing sparse convolutions, only operating on visible parts of the input during pre-training. The masking strategy uses a 0.6 ratio, maintaining the hierarchical structure of ConvNeXt and upsampling masks recursively to the finest resolution. These sparse convolutions are converted back to standard convolutions during fine-tuning, ensuring the absence of train-test inconsistency.

Additionally, the decoder has been designed to be lightweight, using a simple ConvNeXt block instead of more complex architectures like transformers or hierarchical decoders. The reconstruction target remains the mean squared error (MSE) between the reconstructed and target images, applied solely to masked patches. This fully convolutional framework, named Fully Convolutional Masked Autoencoder (FCMAE), demonstrates its effectiveness across a range of models, showing significant improvements in representation quality.

Global Response Normalization (GRN)

One of the novel contributions of the paper is the introduction of Global Response Normalization (GRN) to mitigate the issue of feature collapse observed during MAE pre-training. Feature collapse, characterized by dead or saturated neurons, was particularly evident in ConvNeXt's MLP layers when trained with the FCMAE framework. GRN enhances inter-channel feature competition by aggregating global features via L2 norm and normalizing them to calibrate the input responses. This approach promotes feature diversity, which is crucial for effective masked-based self-supervised learning.

Co-Design: ConvNeXt V2

Combining both architectural enhancements and the FCMAE framework, ConvNeXt V2 emerges as a powerful model. The experiments delineate a significant performance improvement when GRN is integrated into ConvNeXt V2 and trained with FCMAE compared to purely supervised ConvNeXt or ConvNeXt with traditional MAE frameworks. This co-design is critical as the synchronized development of architecture and learning framework yields optimized performance.

Experimental Results

Extensive experiments demonstrate the efficacy of the proposed methods.

ImageNet Classification: ConvNeXt V2 models, ranging from a 3.7M parameter Atto model achieving 76.7% accuracy to a 650M parameter Huge model achieving a remarkable 88.9%, outperformed their predecessors across various computational regimes.
Object Detection and Segmentation: On tasks like COCO and ADE20K, ConvNeXt V2 models pre-trained with FCMAE consistently surpassed both ConvNeXt V1 models and contemporary models like Swin Transformers pre-trained with SimMIM.

Theoretical and Practical Implications

From a theoretical perspective, the introduction of GRN and the findings around feature collapse provide significant insights into the learning behaviors of ConvNets under self-supervised frameworks. The practice of co-designing architecture and learning frameworks may spur further research in integrating other architectural innovations with self-supervised techniques. Practically, the improved performance across various benchmarks suggests that ConvNeXt V2 models are highly robust and efficient. They can be particularly advantageous in applications requiring scalable and high-performance visual recognition systems.

Future Directions

Future research could explore further optimization in the interplay between network architecture and self-supervised learning frameworks. Another avenue could be the application of GRN in different models and domains beyond visual recognition. Given the strong performance of ConvNeXt V2 in large-scale tasks, it would be beneficial to investigate its applications in real-world scenarios requiring real-time processing capabilities.

Conclusion

The paper effectively demonstrates the importance of co-designing neural network architecture and learning frameworks to fully capitalize on the advancements in self-supervised learning. ConvNeXt V2, through the integration of a fully convolutional masked autoencoder framework and the introduction of Global Response Normalization, sets a new standard in visual recognition performance, as evidenced by its exceptional results across multiple benchmarks. This work further solidifies the potential of ConvNets in contemporary AI research and applications, paving the way for future innovations in the field.