- The paper introduces a fully convolutional masked autoencoder integrated with ConvNets to achieve superior self-supervised learning performance.
- It employs Global Response Normalization to overcome feature collapse and enhance inter-channel competition, improving representation diversity.
- Experimental results show ConvNeXt V2 significantly outperforms previous models on benchmarks like ImageNet, COCO, and ADE20K.
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
Introduction
The domain of visual recognition has seen substantial advancements, driven by enhancements in architectures and self-supervised learning frameworks. Specifically, modern convolutional neural networks (ConvNets) such as ConvNeXt have exhibited robust performance across various visual recognition tasks. However, these models, initially optimized for supervised learning scenarios like ImageNet classification, have been less successful when combined with self-supervised learning techniques like Masked Autoencoders (MAE). The paper addresses this by proposing a synergistic framework that integrates fully convolutional masked autoencoders with architectural enhancements, resulting in ConvNeXt V2. This framework succeeds in significantly improving pure ConvNets' performance on multiple benchmarks including ImageNet, COCO, and ADE20K.
Architectural Innovation: Fully Convolutional Masked Autoencoders
The paper proposes an innovative framework that co-designs network architecture and masked autoencoders. The introduction of a fully convolutional masked autoencoder framework enables more efficient processing by utilizing sparse convolutions, only operating on visible parts of the input during pre-training. The masking strategy uses a 0.6 ratio, maintaining the hierarchical structure of ConvNeXt and upsampling masks recursively to the finest resolution. These sparse convolutions are converted back to standard convolutions during fine-tuning, ensuring the absence of train-test inconsistency.
Additionally, the decoder has been designed to be lightweight, using a simple ConvNeXt block instead of more complex architectures like transformers or hierarchical decoders. The reconstruction target remains the mean squared error (MSE) between the reconstructed and target images, applied solely to masked patches. This fully convolutional framework, named Fully Convolutional Masked Autoencoder (FCMAE), demonstrates its effectiveness across a range of models, showing significant improvements in representation quality.
Global Response Normalization (GRN)
One of the novel contributions of the paper is the introduction of Global Response Normalization (GRN) to mitigate the issue of feature collapse observed during MAE pre-training. Feature collapse, characterized by dead or saturated neurons, was particularly evident in ConvNeXt's MLP layers when trained with the FCMAE framework. GRN enhances inter-channel feature competition by aggregating global features via L2 norm and normalizing them to calibrate the input responses. This approach promotes feature diversity, which is crucial for effective masked-based self-supervised learning.
Co-Design: ConvNeXt V2
Combining both architectural enhancements and the FCMAE framework, ConvNeXt V2 emerges as a powerful model. The experiments delineate a significant performance improvement when GRN is integrated into ConvNeXt V2 and trained with FCMAE compared to purely supervised ConvNeXt or ConvNeXt with traditional MAE frameworks. This co-design is critical as the synchronized development of architecture and learning framework yields optimized performance.
Experimental Results
Extensive experiments demonstrate the efficacy of the proposed methods.
- ImageNet Classification: ConvNeXt V2 models, ranging from a 3.7M parameter Atto model achieving 76.7% accuracy to a 650M parameter Huge model achieving a remarkable 88.9%, outperformed their predecessors across various computational regimes.
- Object Detection and Segmentation: On tasks like COCO and ADE20K, ConvNeXt V2 models pre-trained with FCMAE consistently surpassed both ConvNeXt V1 models and contemporary models like Swin Transformers pre-trained with SimMIM.
Theoretical and Practical Implications
From a theoretical perspective, the introduction of GRN and the findings around feature collapse provide significant insights into the learning behaviors of ConvNets under self-supervised frameworks. The practice of co-designing architecture and learning frameworks may spur further research in integrating other architectural innovations with self-supervised techniques. Practically, the improved performance across various benchmarks suggests that ConvNeXt V2 models are highly robust and efficient. They can be particularly advantageous in applications requiring scalable and high-performance visual recognition systems.
Future Directions
Future research could explore further optimization in the interplay between network architecture and self-supervised learning frameworks. Another avenue could be the application of GRN in different models and domains beyond visual recognition. Given the strong performance of ConvNeXt V2 in large-scale tasks, it would be beneficial to investigate its applications in real-world scenarios requiring real-time processing capabilities.
Conclusion
The paper effectively demonstrates the importance of co-designing neural network architecture and learning frameworks to fully capitalize on the advancements in self-supervised learning. ConvNeXt V2, through the integration of a fully convolutional masked autoencoder framework and the introduction of Global Response Normalization, sets a new standard in visual recognition performance, as evidenced by its exceptional results across multiple benchmarks. This work further solidifies the potential of ConvNets in contemporary AI research and applications, paving the way for future innovations in the field.