ConvNets Match Vision Transformers at Scale
The paper "ConvNets Match Vision Transformers at Scale," authored by Samuel L. Smith, Andrew Brock, Leonard Berrada, and Soham De from Google DeepMind, critically examines the prevalent assertion in computer vision that Vision Transformers (ViTs) surpass Convolutional Neural Networks (ConvNets) at scale. This paper explores an extensive empirical comparison between a ConvNet architecture and ViTs, utilizing substantial computational resources for pre-training on a large-scale dataset.
Introduction
ConvNets have been foundational in the field of computer vision, achieving early successes and dominating benchmarks for nearly a decade. However, with the advent of ViT, there has been a shift towards transformer-based architectures for image recognition tasks. The current consensus suggests that ViTs exhibit superior scaling properties when trained on substantial datasets collected from the web. This paper challenges this prevailing view by rigorously evaluating the NFNet model family, a state-of-the-art ConvNet architecture, pre-trained on the JFT-4B dataset to compare performance against ViTs under equivalent computational budgets.
Methodology
The NFNet models, including various configurations from F0 to F7+, were pre-trained on the JFT-4B dataset, encompassing around 4 billion labeled images. The pre-training was carried out across various compute budgets extending from 0.4k to 110k TPU-v4 core hours. The training methodology adhered to established practices using SGD with Momentum, Adaptive Gradient Clipping (AGC), and distinct image resolutions during training and evaluation.
A log-log scaling law between held-out validation loss and compute budget was prominent, akin to trends observed in LLMing with transformers. Furthermore, the paper explored optimal epoch budgets and learning rates across varying model sizes, adhering to the principle that both should be scaled proportionally with compute budgets, resonating with earlier findings in LLMing.
Results
Upon fine-tuning the pre-trained NFNets on ImageNet, the performance metrics revealed that NFNets were on par with ViT counterparts in terms of Top-1 accuracy. Notably, the NFNet-F7+ model achieved a Top-1 accuracy of 90.4% after fine-tuning with repeated augmentation, demonstrating significant improvements over previous NFNet benchmarks without additional data. This parity in performance was maintained even when considering comparable compute budgets, highlighting the efficacy of ConvNets at scale.
Analysis
The results underscore that, with sufficient computational resources and dataset sizes, ConvNets can match the performance of ViTs. This finding contests the prevailing assumption that ViTs inherently possess superior scaling properties. The linear trend observed in the log-log scaling between validation loss and compute budget prompts a re-evaluation of current biases towards transformer architectures.
Discussion
The paper's implications are twofold. Practically, it suggests that researchers can still rely on ConvNets for competitive performance in large-scale vision tasks, provided adequate compute and data. Theoretically, it challenges the narrative favoring transformers, advocating for a more nuanced view that considers the critical role of compute and data irrespective of the model architecture. Future developments in AI will likely explore hybrid models, leveraging the strengths of both ConvNets and transformers.
Conclusion
In conclusion, the paper robustly demonstrates that ConvNets can indeed match the performance of ViTs at scale, challenging the orthodoxy in contemporary computer vision research. By providing rigorous empirical evidence, the paper invites the research community to re-assess the comparative advantages of these architectures, potentially fostering a more balanced approach to developing future AI models.