- The paper demonstrates that models trained from random initialization can match or outperform ImageNet pre-trained models on COCO, achieving up to 42.7 AP with ResNet-101.
- It reveals that techniques like Group Normalization and proper weight initialization enable competitive performance without the need for extensive pre-training.
- The study indicates that while pre-training accelerates early convergence, prolonged training on target data can effectively eliminate its advantage.
Rethinking ImageNet Pre-training
The paper "Rethinking ImageNet Pre-training" by Kaiming He, Ross Girshick, and Piotr Doll challenges the conventional paradigm in computer vision where models are pre-trained on large-scale datasets like ImageNet and then fine-tuned on specific target tasks. The paper presents empirical evidence suggesting that training models from random initialization can achieve performance comparable to (and occasionally better than) models that utilize ImageNet pre-training, given that adequate data and computational resources are available.
Summary of Findings
The paper's experiments focus primarily on the COCO dataset for object detection and instance segmentation tasks, using the Mask R-CNN framework with a variety of backbones including ResNet, ResNeXt, and VGG. Key observations include:
- Competitive Performance Without Pre-training: Models trained from random initialization achieve comparable results to their ImageNet pre-trained counterparts when sufficient training iterations are provided. This includes achieving 41.3 AP for ResNet-50 and 42.7 AP for ResNet-101 on COCO object detection, without utilizing any pre-trained weights.
- Enhanced Architectures and Training Techniques: Utilizing techniques like Group Normalization (GN) and Synchronized Batch Normalization (SyncBN), which are robust to smaller batch sizes, enabled training from scratch. The use of a proper initialization normalization further facilitated this process.
- Convergence Behaviour: ImageNet pre-training principally accelerates early-stage convergence but does not necessarily contribute to better final accuracy. When models are allowed sufficient training duration, the advantage held by pre-trained models diminishes.
- Data Sufficiency: Even when the available training data is reduced to 10% of the COCO dataset (around 10k images), models trained from scratch managed to perform competitively, reaching up to 25.9 AP compared to 26.0 AP of pre-trained models. However, the performance gap widens when the data size is reduced to 3.5k images or fewer, where pre-trained models show a clear advantage.
- Task Sensitivity: When the target task involves fine spatial localization, such as in keypoint detection or higher IoU thresholds, models trained from scratch tend to perform better, indicating that the pre-training on classification tasks doesn’t adequately capture localization sensitivities.
Implications of the Research
The key implications of this paper are as follows:
In scenarios where large-scale target domain data collection is feasible, investing in this data is likely more beneficial than relying on generic large-scale pre-training datasets like ImageNet. This could lead to better domain-specific performance improvements and alleviate the dependency on pre-trained models.
Researchers should reconsider the de facto reliance on ImageNet pre-training, especially in assessing new methodologies for model initialization or self-supervised learning. Evaluating these methodologies without the baseline of ImageNet pre-training might reveal additional insights.
- Universal Representations:
The notion of learning universal feature representations remains valid but needs careful empirical validation. The diminishing returns of enlarging classification datasets for pre-training highlight that other avenues like fine-tuning large amounts of domain-specific data might be more fruitful.
Future Directions
These findings prompt several potential research directions:
- Optimization Schedules: Investigating more sophisticated optimization schedules and normalization strategies that can expedite convergence for models trained from scratch.
- Hybrid Approaches: Combining minimal pre-training on smaller or synthetic datasets with extensive fine-tuning on target tasks to balance convergence speed and accuracy.
- Task-specific Pre-training: Exploring task-specific pre-training datasets that better capture the nuances of the target tasks, especially those involving detailed spatial or localization requirements.
- Automated Hyper-parameter Search: Implementing more robust automated methods for hyper-parameter search that can mitigate the overfitting issues observed with smaller datasets without the need for pre-training.
- Continual Learning: Developing frameworks where models can continually learn from new data, focusing on both maintaining performance on old tasks and improving on new tasks without the explicit need for large pre-trained models.
Conclusion
This paper significantly contributes to the ongoing discourse on the role of pre-training in deep learning, particularly in computer vision. By empirically demonstrating that models trained from scratch can match the performance of those relying on extensive pre-training, it prompts a reassessment of long-standing assumptions. As the field moves forward, these insights will likely catalyze more nuanced, domain-specific, and efficient training methodologies that leverage the inherent capacities of deep neural networks effectively.